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Abstract 


This  report  describes  the  design  and  implementation  of  a 
I.ALR(1)  parse  table  constructor  and  a  parser  utilizing  the 
tables.  The  constructor  is  noteworthy  because  it  allows 
semantic  action  symbols  to  be  embedded  at  near-arbitrary 
points  of  each  production,  and  because  regular  right  parts 
are  allowed  in  productions. 

The  theory  of  LR(k)  parse  table  construction  is 
developed  using  Aho  and  Ullman's  model  involving  pairs  of 
functions  (f,g).  DeRemer's  incremental  approach  (LR(0), 
SLR(1),  and  LALR(1))  is  explained,  still  in  terms  of  the 
(f,g)  model.  The  pieces  of  the  systems  are  described  in 
detail,  including  the  parser,  grammar  normalization,  grammar 
analysis,  table  construction,  and  table  optimization. 

Next  comes  a  discussion  of  the  theory  behind  the 
embedding  of  semantic  action  symbols.  We  relate  our  work  to 
that  of  Lewis  and  Stearns,  in  terms  of  their  'Derived  Symbol 
Polish'  grammars. 
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Chapter  1 
Introduction 


This  thesis  describes  the  design  of  an  efficient  LALE(1) 
parse  table  constructor,  and  a  parsing  system  employing  the 
tables.  We  have  incorporated  into  the  constructor  several 
novel  features  which  we  feel  to  be  particularly  useful  to 
the  user  of  such  a  system.  These  features  include 

1)  allowing  regular  expressions  in  the  right  hand  sides 
of  productions;  and 

2)  associating  semantic  actions  with  points  of  a 
production  other  than  its  completion. 

The  LB (k)  grammars  constitute  the  largest  known  class  of 
grammars  for  which  bottom-up,  deterministic,  shift-reduce 
parsers  using  k-symbol  lookahead  can  be  automatically 
generated,  Knuth*s  original  article  on  the  subject  [Knuth 
19651  discussed  two  formulations  of  the  LB  parsing 
algorithm,  of  which  the  second  can  be  converted  into  a 
parser  constructor  algorithm  by  some  simple  modifications. 
However,  the  number  of  ’tables*  generated  by  this  algorithm 
is  quite  large  —  large  enough  to  make  the  algorithm 
impractical,  even  for  k  =  1.  In  1969,  first  Korenjak 
[  Korenjak  1969  ],  and  then  DeBemer  [ DeBemer  1969,1971  ] 
described  modifications  of  Knuth* s  original  algorithm  which 
greatly  improved  its  efficiency.  Not  long  after  that, 
LaLonde  implemented  DeBemer's  algorithms  at  the  University 
of  Toronto  [LaLonde  1971].  Since  that  time,  there  have  been 
several  implementations  of  LB  parser  generator  systems, 
including  Pager’s  [Pager  1973a],  YACC  [Johnson  1974],  and 
lately,  LABK  [Adams  and  White  1975], 

The  goals  of  our  system  are: 

1)  The  constructor  should  be  easy  to  understand  and 
maintain ; 

2)  The  constructor  should  be  simple  to  use; 

3)  The  constructor  should  provide  its  user  with  as  much 
•useful*  information  about  the  grammar  being  processed  as  is 
possible.  Quite  often,  the  user  has  just  produced  the 
grammar  and  is  attempting  to  ’debug*  it,  in  order  to  ensure 
that  it  really  does  define  the  desired  language.  We  are 
thus  assuming  that  a  constructor  is  utilized  as  much  for 
deriving  information  about  a  grammar  as  it  is  for  actually 
producing  parse  tables, 

4)  The  constructor  should  be  reliable  and  robust; 
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5)  The  constructor  should  produce  parse  tables  from  a 
variety  of  grammatical  presentation  forms; 


6)  The  parsing  system  employing  the  parse  tables  should 
consist  of  clear,  minimal  code,  and  should  be  easily 
modifiable  to  fit  the  user’s  changing  needs. 


1,1:  Semantics 

Syntactic  analysis  is  mainly  performed  in  order  to 
derive  semantic  information.  Most  bottom-up  parsing 
systems,  such  as  LaLonde's  and  YACC,  allow  semantic 
processing  to  be  associated  with  the  completion  of  each 
production.  Top-down  systems,  on  the  other  hand,  are  much 
more  flexible  than  traditional  bottora-up  ones,  because  they 
allow  semantics  at  arbitrary  points  in  productions;  that  is, 
'semantic  actions'  can  be  placed  anywhere  in  the  right  hand 
side  of  a  production  that  a  grammar  symbol  can  be,  Knuth 
states  that  "...  top-down  analysis  is  popular  chiefly 
because  it  lends  itself  so  conveniently  to  semantic 
extensions"  [Knuth  1971,  p,79]. 

Our  system  allows  semantic  action  symbols  to  be 
associated  with  points  in  a  production  other  than  just  the 
completion  of  the  production.  These  points  are  not 
completely  arbitrary,  as  we  shall  see,  but  we  will  prove 
that  our  system  can  correctly  process  any  grammar  (with 
arbitrary  semantics)  that  an  LL(1)  parser  generator  system 
can  process.  Thus,  our  system  provides  the  syntactic 
recognition  power  of  LALF  (1).  and  includes  semantics  much  as 
an  LL(1)  system  would. 

The  system  described  here  has  been  implemented  at  the 
University  of  Toronto,  Both  the  constructor  and  the  parsing 
system  are  written  in  SUE, 8,  a  compatible  subset  of  the  SUE 
System  Language  [Clark  and  Ham  1974], 

The  constructor  has  been  used  to  produce  parse  tables 
for  SUE. 8. 


1.2;  Organization  of  this  Feport 

Chapter  2  discusses  the  theory  of  LR  parsing  and  parse 
table  construction,  using  Aho  and  Ullman's  model  [ Aho  and 
Ullman  1972a, b; 1 973a, b ],  The  last  section  of  Chapter  2, 
Section  2.4,  describes  the  theoretical  basis  for  association 
of  semantics  with  syntax  —  syntax-directed  translation 
schemata. 

Chapter  3  describes  the  organization  of  a  parsing  system 
which  employs  the  tables  constructed  by  the  constructor. 

Each  piece  of  the  system  is  briefly  described,  and  a 
hierarchical  view  of  the  system  is  given. 
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Chapter  4  constitutes  the  main  body  of  the  paper,  and  is 
a  description  of  the  design  of  the  constructor.  The  chapter 
is  divided  into  six  sections,  mirroring  the  divisions  of  the 
constructor.  Section  4.1  discusses  grammatical  input  and 
normalization,  and  Section  4.2  discusses  grammatical 
analysis.  The  analysis  phase  derives  several  items  of 
interest  to  the  constructor  user.  Section  4.3  describes  the 
incremental  approach  to  parser  construction  —  LE  (0) , 

SLR(1),  and  LALE(1),  in  terms  of  Aho  and  Ullman*s  model. 
Section  4.4  discusses  the  inclusion  of  semantics  in  the 
construction  algorithm.  We  also  characterize  the  class  of 
translations  we  can  perform,  by  relating  our  methods  to 
those  of  Lewis  and  Stearns,  who  have  proven  that  systems  of 
this  type  can  translate  the  class  of  grammars  known  as  the 
’derived  symbol  Polish'  grammars  [Lewis  and  Stearns  1968]. 

We  prove  that  we  can  process  all  LL(1)  grammars  with 
semantics  at  arbitrary  points  of  productions.  Section  4.5 
discusses  the  table  optimization  phase  of  the  constructor, 
and  Section  4.6  describes  the  communication  between  the 
constructor  and  the  parsing  system,  as  well  as  the  output  of 
the  information  derived  from  the  grammar  for  the  user's 
benef it . 


Chapter 

interesting 


5  summarizes  our  work  and  presents  several 
topics  deserving  future  study. 
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chapter  2 

Background  and  Notation 


In  this  chapter  we  discuss  the  theory  and  use  of  LP 
grammars  and  parse  tables.  In  Section  2.2,  we  discuss  the 
use  of  parse  tables  in  the  IP  parsing  process.  The 
canonical  method  of  table  construction  is  described  in 
Section  2.3.  Before  we  discuss  the  parse  tables,  however, 
we  review  some  basic  definitions  in  Section  2.1. 


2.1:  Preliminary  Definitions 

We  denote  the  null  string  by  the  symbol  e,  throughout. 

If  X  and  Y  are  sets  of  symbols,  then  we  define  XY  to  be  the 
set  {xy  I  X  €  X  and  y  €  Y} ,  where  xy  denotes  the 
concatenation  of  symbols  x  and  y.  If  X  is  a  set  of  symbols, 
then 

X  power  0  =  {e}  , 

X  power  (i+1)  =  (X  power  i)X,  for  i  >  0,  and 
X*  denotes  the  union  of  X  power  i,  for  all  i  >  0. 

More  intuitively,  X*  is  the  set  of  all  strings  formed  from 
symbols  in  X;  X*  includes  the  null  string  e. 

A  (context  free)  grammar  is  a  quadruple 
G  =  <Vn,  Vt,  S,  P>,  where 

Vn  is  a  finite  set  of  symbols,  called  nonterminal 
symbols; 

Vt  is  a  finite  set  of  symbols,  called  terminal  symbols 
(Vn  and  Vt  are  disjoint) ; 

S  €  Vn  is  the  start  or  goal  symbol;  and 

P,  the  production  set,  is  a  finite  subset  of  the 

Cartesian  product  of  Vn  and  V*,  where  V  is  the  union 
of  Vn  and  Vt;  each  element  (A,z)  of  P  is  denoted 

A  ->  z 

where  A  €  Vn  and  z  €  V*. 

We  will  order  the  productions  in  the  production  set  in 
what  appears  to  be  an  arbitrary  manner,  by  assigning  them 
integers  1,2,...,|P|,  where  IPj  is  the  number  of 
productions.  We  generically  denote  the  £th  production  as 

(p)  A  ->  XI  X2  ...  X  (#p) 

where  each  Xi  for  0<i<#p  is  a  grammar  symbol,  and  #p>0  is 
the  number  of  symbols  in  the  right  hand  side  of  production 
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p.  The  ordering  of  the  productions  imposed  by  the 
ccnstructor  is  not  at  all  arbitrary,  as  shall  be  seen  in 
Section  4.2. 

We  use...  to  denote... 


1 


r  • 


. . .X,Y,Z 


a,b,c.. . 
p,g, r. . . 
. . .  X, y, z 


nonterminal  symbols 
semantic  action  symbols 
single  nonterminal 

or  terminal  symbols 
terminal  symbols 
strings  of  terminal  symbols 
strings  of  nonterminal 
and  terminal  symbols 


We  define  the  operation  =>,  read  as  'immediately 
derives*,  as  u  v  if  and  only  if  there  exist  strings 
x,y,z  e  W*  and  a  production  A  y  in  P,  such  that  u  =  xAz 
and  V  =  xyz.  Intuitively,  any  nonterminal  symbol  A  in  the 
string  of  symbols  composing  u  may  be  replaced  by  the  right 
hand  side  of  one  of  A's  defining  productions.  We  use  the 
symbol  =^rm,  'immediately  derives  rightmost*,  if  A  is  the 
rightmost  nonterminal  symbol  in  u.  We  may  next  define  =»»*, 
the  reflexive  and  transitive  closure  of  =>,  read  as 
'derives':  u  =>*  v  if  and  only  if  there  exist  strings  x1 , 
x2,  ...,  xn,  n  >  0,  such  that  u  =  x 1  =>  x 2  =>  . . .  =>  xn  =  v. 


In  this  paper,  we  will  almost  always  be  using  the 
rightmost  derivation  in  our  examples  and  discussion;  we 
therefore  choose  the  rightmost  derivation  as  our  canonical 
derivation,  and  drop  the  *rm'  from  the  symbol  whenever  the 
context  clearly  implies  it. 

An  A~sentential  form  of  a  grammar  G  is  any  string 
derivable  from  A:  it  is  a  member  of  the  set  {x  6  V*  | 

A  =>*  X}.  S-sentential  forms,  where  S  €  Vn  is  the  start 
symbol  of  the  grammar  G,  are  simply  called  sentential  forms. 

The  language  generated  by  the  grammar  G  is  the  set  of 
terminal  sentential  forms:  L(G)  =  {q  €  Vt*  |  S  =>*  q} . 

If  x,y,z  €  V*,  with  x  =  yz,  then  we  call  y  a  prefix  of 
X,  and  z  a  suffix. 

The  terminal  prefixes  of  length  less  than  or  equal  to  k 
of  a  string  z  €  V*  are  given  by  the  set 

FIRST_k(z)  =  {q  €  Vt*  |  z  =>*  qy,  and  either  |q|  =  k,  or 
(y  =  e  and  |q|  <  k) } ,  where  we  denote  by  |g|  the  number  of 
symbols  in  q.  We  include  e  in  FIEST_k  (z)  if  z  =>*  e.  We 
will  deal  for  the  most  part  with  the  case  k  <  1.  We  will 
therefore  drop  the  'k'  from  the  set  name. 

We  will  have  occasion  to  refer  in  the  sequel  to  a 
special  subset  of  FIPST^k,  called  €_free_FIEST  k: 
e_free_FIRST_k (z)  =  {g  €  Vt*  |  q  €  FIFSl'k  (z)  and  the  last 
step  in  the  (canonical)  derivation  of  g  from  z,  if  it 
exists,  does  not  use  a  production  of  the  form  A  ->  e} .  The 
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qualifier  'if  it  exists*  in  the  definition  of  e_f ree_FIRST_k 
above  refers  to  the  fact  that  z  may  be  a  terminal  string,  in 
which  case  there  is  no  derivation.  Note  that,  in  general, 
e_free_FIRST_k  (z)  *  FIRST_k  (z)  -  {e}.  For  example,  for  the 

grammar  with  productions  S  ->  SaSb  and  S  ->  e,  we  have 
FIRST  (S)  =  {a,e} ,  while  e_f ree_FIRST_k (S)  is  empty  [ Aho  and 
Ullman  1973b]. 

For  A  e  Vn,  we  define  the  FOLLOW  set  as  FOLLOW(A)  = 

{a  €  Vt*  I  S  =>*  xAy,  for  some  x,y  €  V*,  |a|  <  1,  and 

a  €  FIRST  (y)}.  In  other  words,  FOLLOW (A)  gives  the  set  of 
terminal  symbols  which  may  follow  nonterminal  symbol  A  in  a 
sentential  form.  We  include  e  in  FOLLOW (A)  if  A  can  be  the 
rightmost  symbol  of  a  sentential  form. 

We  now  turn  our  attention  to  the  wealth  of  terminology 
developed  in  LR  parsing. 

For  a  general  (context  free)  grammar,  a  parse  of  a 
sentence  in  the  language  generated  by  the  grammar  is  an 
indication  of  how  the  sentence  was  formed.  In  LR 
terminology,  the  canonical  parse  of  a  sentence  is  the 
reverse  of  the  canonical  derivation  of  the  sentence. 
Formally,  suppose  S  =  xO  =^rffi  x1  =^rm  ...  =>rm  xn  €  Vt*, 
xi  =  yiAizi,  and  x(i+1)  =  yiwizi,  for  0  <  i  <  n,  where 
yi,wi,zi  €  V*,  Ai  €  Vn,  and  each  pi  =  Ai  ->  wi  is  a 
production  in  P.  Then  the  canonical  derivation  is 
d  =  p0p1...p(n-1),  and  the  canonical  parse  is  p  (n-1) . . . p1 pC . 
A  parser  for  a  grammar  G  is  a  machine  which  accepts  as  input 
strings  of  symbols  and  outputs  for  each  string  either  the 
canonical  parse,  if  the  string  is  in  L (G) ,  or  an  error 
designator,  if  the  string  is  not  in  L (G)  . 

The  augmented  grammar  associated  with  a  grammar 
G  =  <Vn,  Vt,  S,  P>  is  the  grammar  G'  =  <Vn',  Vt,  S',  P'>, 
derived  from  G  by  adding  to  Vn  a  new  nonterminal  symbol  S', 
making  it  the  start  symbol  of  G',  and  adding  to  P  a  new 
production  S*  ->  S.  Thus,  to  derive  G'  we  set  Vn'  =  Vn 
union  {S'},  and  P'  =  P  union  (S'  ->  S} . 


2.2:  LR  Grammars  and  Parsing 

The  concept  of  LR  grammars  was  first  introduced  by 
Donald  Knuth  in  1965.  His  classic  paper  "On  the  Translation 
of  Languages  from  Left  to  Right"  [Knuth  1965]  was  the 
beginning  of  a  great  deal  of  research  into  languages  whose 
grammars  allow  automatic  construction  of  deterministic 
parsers  performing  in  a  left  to  right  manner,  in  an  amount 
of  time  (and  space)  which  is  linearly  proportional  to  the 
length  of  the  input  string.  The  acronym  'LR(k)'  originally 
stood  for  'translatable  from  left  to  right  with  bound  k' , 
but  has  since  been  re-defined  as  'translatable  from  left  to 
right,  using  right  reductions,  with  at  most  k-symbol 
lookahead'  [Knuth  1971].  This  re-definition  was  required 
when  LL(k)  grammars  were  conceived  ([Lewis  and  Stearns 
1968]).  The  current  definition  mentions  'reductions'. 


-6- 


implying  a  bottom-up  parsing  method  in  order  to  distinguish 
LR  from  LI,  which  reads  as  ’translatable  from  left  to  right, 
using  left  productions,  with  at  most  k  symbol  lookahead*;  LL 
is  a  top-down  parsing  method. 

The  LE  parsing  process  consists  of  shifting  input 
symbols  onto  a  stack  until  the  stack  contains  symbols 
representing  the  complete  right  hand  side  of  some 
production;  the  stack  is  then  reduced  by  replacing  the 
symbols  of  the  production  just  recognized  by  the  nonterminal 
symbol  defined  by  the  production.  The  actions  of  shifting 
and  reducing  continue  until  either  an  error  is  detected,  or 
the  string  is  accepted.  If  the  input  string  is  in  the 
language,  at  some  point  the  parser  will  have  read  the  entire 
string  and  only  the  start  symbol  will  be  left  on  the  stack, 
at  which  point  the  parser  accepts  the  input  string. 

The  LE  grammars  define  the  class  of  shift-reduce 
parsable  grammars  and  contain  the  operator  precedence 
grammars  [Floyd  1963],  the  simple  precedence  grammars  [Wirth 
and  Weber  1966],  and  the  mixed  strategy  precedence  grammars 
[ McKeeman,  Horning  and  Wortman  1970].  The  LE  parsing 
technique  is  unique  among  shift-reduce  techniques  in  that  it 
detects  errors  in  the  input  string  at  the  earliest  possible 
opportunity  in  a  left  to  right  scan  —  before  shifting  the 
first  symbol  that  cannot  follow  the  input  string  seen  so 
far. 


A  gramni,ar  G  is  LE  (1)  if  the  three  conditions 

1)  S  =^»*rm  xAz  =>rm  xyz 

2)  S  =^*rm  uBw  =^rm  xyv 
and  3)  FIEST(v)  =  FIEST(z) 

taken  together  imply  that  x  =  u,  A  =  B,  and  v  =  w;  i.e., 
that  xAv  =  uBw. 

This  definition  formalizes  the  intuitive  notion  that  among 
all  strings  having  prefix  xyFIEST  (z) ,  if  at  least  one  string 
was  canonically  derived  by  replacing  an  A  after  x  with  y, 
then  all  were. 

He  have  given  a  definition  of  LE (J)  grammars;  the 
definition  is  easily  extended  to  LE  (k)  by  modifying 
condition  three  above  to  use  the  terminal  prefixes  of  length 
k  or  less:  FIEST_k(v)  =  FIRST_k (z) . 

Another  definition  which  we  will  use  in  Chapter  4  is  as 
follows:  a  grammar  G  is  LE(1)  if  the  parse  table 

construction  method  to  be  described  succeeds  in  building 
parse  tables  for  G. 

There  are  several  different  presentations  of  LR  (k) 
parsing  methods  in  the  literature.  Each  presentation  uses  a 
large  amount  of  different  terminology.  Our  presentation 
fellows,  for  the  most  part,  that  of  [ Aho  and  Oilman  1972a, b; 
1973  a,b]. 
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An  LE  (k)  parser  operates  by  utilizing  a  set  of  pairs  of 
functions  (f,g) .  Each  such  pair  is  called  a  table.  The 
function  f  is  termed  the  parsing  action  function;  it  maps 
lookahead  strings  in  Vt*  of  length  k  into  the  parsing  action 
set  {shift,  error,  accept,  reduce  0,  reduce  1,  y educe 

IP|}.  f  is  applied  to  the  first  k  symbols  of  the  remaining 
input  string  to  determine  whether  the  parser  should  perform 
a  shift  action,  a  reduce  action,  or  halt  due  to  erroneous 
input  or  because  the  string  has  been  accepted.  The  function 
g  is  termed  the  goto  function,  and  determines  the  next  table 
to  use  after  a  parsing  action  has  been  performed.  g  maps  V 
into  the  set  of  tables  and  an  eppor  designator. 

The  parser  uses  the  tables  in  conjunction  with  a  parse 
parse  stack,  at  any  one  moment  during  the 
parsing  process,  contains  a  current  history  of  the  parse 
attempt.  The  stack  contains  a  sequence  of  tables  TO,  T1 , 
...,  Tn.  These  tables  correspond  to  grammar  symbols  on  the 
right  hand  sides  of  productions.  Initially,  the  stack 
contains  only  the  initial  table,  TO.  Let  us  denote  the 
current  table  (the  one  on  top  of  the  stack)  by  Tn,  and  let  a 
be  the  next  symbol  in  the  input  string  g;  then  a  parsing 
step  is  accomplished  as  follows: 

1)  compute  the  lookahead  string  u  =  FIEST_k(g). 

2)  use  the  parsing  action  function  f  of  the  current 
table  Tn  to  determine  the  parsing  action  for  this 
step: 

2a)  if  f  (u)  =  shift,  the  parser  performs  the  shift 
action  by  reading  the  next  symbol  a  from  the 
input  string  q  and  stacking  g  (a) ,  which  is  the 
next  table  to  use  (we  may  note  that  if  k  =  1 , 
then  a  =  u)  . 

2b)  if  f  (u)  =  reduce  i,  where  i  is  the  index  of 

production  Ai  ->  wi  in  P,  then  the  parser  pops 
from  the  stack  one  table  for  each  grammar 
symbol  in  the  right  hand  side  of  the  ith 
production,  wi.  The  parser  then  stacks  gm(Ai), 
where  Tm  is  now  the  uppermost  table  on  the 
stack.  Intuitively,  when  the  parser  pops  the 
stack  it  is  discarding  the  details  that  were 
saved  while  recognizing  wi;  likewise,  in 
stacking  gm(Ai)  the  parser  is  saving  the  table 
corresponding  to  the  nonterminal  symbol  Ai. 

2c)  if  f (u)  =  error,  the  parsing  process  can  halt, 
reporting  an  illegal  input  string.  More 
practically,  some  method  of  error  recovery  can 
be  employed  in  order  to  *  retrack*  the  parser 
and  continue  with  the  parse. 

2d)  if  f  (u)  =  accept ,  the  parser  halts;  the  input 
string  (perhaps  suitably  modified  by  error 
recovery)  has  been  accepted. 
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The  parser  has  now  completed  a  parsing  step,  and  can 
begin  again.  The  cycle  continues  until  either  the  input 
string  is  accepted  or  an  irrecoverable  error  occurs. 

Given  a  set  of  tables  for  a  grammar,  the  LE  parser 
performs  an  efficient  parse  of  input  strings  presented  to 
it.  The  question  naturally  arises,  how  do  we  construct 
these  tables?  The  answer  lies  in  an  understanding  of  the 
information  contained  in  a  table. 


2.3:  The  Canonical  Method  of  LE(k)  Parse  Table  Construction 

Each  parse  table  is  associated  with  an  equivalence  class 
of  partially  recognized  productions.  The  constructor 
enumerates  all  possible  parses,  by  first  constructing  these 
equivalence  classes,  and  then  deriving  a  table  from  each 
equivalence  class.  We  next  define  a  notation  for  more 
formally  dealing  with  partially  recognized  productions, 
called  items,  and  equivalence  classes,  called  item  sets. 

An  LE  fk)  item  is  an  ordered  triple  (p,  j,  u) ,  where 
u  is  a  lookahead  string 
(u  €  Vt*,  and  |  u !  <  k)  ; 
p  is  the  index  of  some  production  in  P;  and 
j  is  either  the  index  of  some  symbol  on  the  right 
hand  side  of  the  £th  production,  or  zero. 

The  existence  of  an  item  (p,j,u)  in  an  item  set 
indicates  that,  if  the  parser  is  in  the  table  associated 
with  this  set  of  items,  then  the  parse  attempt  has  reached  a 
point  where  it  is  possible  that 

1)  production  p  has  been  used  to  generate  a  portion  of 
the  input  examined  so  far; 

2)  the  first  j  symbols  of  production  p  have  already  been 
recognized;  and 

3)  the  terminal  string  u  follows  the  string  generated  by 
production  p. 

In  an  attempt  to  make  the  meaning  of  an  item  more 
intuitive,  it  is  customary  to  denote  (p,j,u)  as 
[ A  ->  x*y,ul.  Here  A  is  the  nonterminal  defined  by  the  pth 
production,  x  =  XI  X2  ...  X(j),  and  y  =  X(j  +  1)  ...  X(#p). 

The  notation  [ A  ->  x»y,u]  indicates  that  we  may  be  in  the 
act  of  recognizing  an  instance  of  the  nonterminal  A  in  the 
form  xy,  and  have  so  far  recognized  part  of  the  production, 
namely  x«  We  may  validly  see  some  string  derivable  from  y, 
followed  by  u,  as  the  next  portion  of  the  input  string.  We 
refer  to  the  symbol  X(j+1),  the  grammar  symbol  immediately 
to  the  right  of  the  'dot*,  as  the  transition  symbol  of  the 
item.  An  item  is  termed  f ina 1  if  y  =  e,  initial  if  x  =  e 
and  y  ^  e,  empty  ifx=y=e,  and  intermediate  otherwise 
[DeEemer  1974],  We  should  point  out  that  by  this  definition 
an  empty  item  is  also  a  final  item. 
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The  item  set  construction  algorithm  starts  initially 
with  the  single  item  set  {[S'  ‘S^e]}  (i.  e. ,  {(0,0,e)}), 

indicating  the  fact  that  we  have  so  far  recognized  zero 
symbols  of  the  zeroth  production,  where  by  convention  the 
zeroth  production  is  the  one  defining  the  goal  symbol  of  the 
augmented  grammar. 

The  algorithm  performs  a  cgmple  tion  operation  on  each 
item  set  in  the  collection.  This  completion  operation  may 
generate  new  sets  of  items,  which  are  added  to  the 
collection  and  completed  in  turn.  The  algorithm  terminates 
when  all  item  sets  in  the  collection  have  been  completed. 
Intuitively,  the  completion  of  an  item  set  corresponds  to 
the  determination  of  (1)  each  symbol  the  parser  could  read 
while  using  the  table  associated  with  this  item  set,  and  (2) 
the  next  table  to  use  after  reading  that  particular  symbol. 
These  two  notions  are  embodied  in  the  completion  operation 
by  defining  completion  in  terms  of  the  two  operations 
closure  and  successor. 

Given  an  uncompleted  item  set  B,  often  called  the  basis 
set  or  n  ucleus.  we  recursively  define  C,  the  closure  of  B, 
as  the  set 

C  =  B  union  {[A  ->  •w,s]  |  [D  ->  x«Ay,u]  €  C, 

A  6  Vn,  and 
s  e  FIRST_k (yu)  }  . 

Intuitively,  if  [D  ->  x«Ay,u]  €  C,  then,  in  a  parse  the  next 
symbol  A  could  be  about  to  be  recognized.  If  this  symbol  is 
a  nonterminal  symbol  then  we  could  be  about  to  recognize  the 
first  symbol  in  any  of  the  productions  defining  A.  We 
therefore  add  [ A  ->  •w,s]  to  C,  for  each  production  A  ->  w 
defining  A,  and  each  lookahead  string  s.  The  lookahead 
strings,  as  we  mentioned  above,  are  the  strings  we  can 
expect  to  see  when  we  have  finally  recognized  an  instance  of 
A.  Since  we  started  the  recognition  attempt  using  the  item 
[ D  x«Ay,u],  we  would  expect  to  see  a  string  that  has  as 
one  of  its  prefixes  some  string  derivable  from  yu;  since  we 
are  interested  in  at  most  the  first  k  symbols  of  yu,  we 
choose  all  s  such  that  s  6  FIRST_k(yu). 

The  closure  operation  must  terminate  because  we  can  add 
only  a  finite  number  of  items  to  C.  There  are  a  finite 
number  of  productions,  and  a  finite  number  of  lookahead 
strings.  The  number  of  items  added  by  the  closure  operation 
is  bounded  by  the  product  of  these  two. 

We  must  next  define  the  successor  operation  on  C.  Let 
T  (C,z)  be  the  set  of  items 

T(C,Z)  =  {[D  xZ«y,u]  I  [  D  x«Zy,u]  €  C} 

In  other  words,  T(C,Z)  generates  the  basis  set  associated 
with  the  table  that  the  parser  should  use  next  if  it  reads  a 
Z  while  using  the  current  table.  The  'T'  in  T(C,Z)  is 
derived  from  transition.  The  successor  operation  computes 
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T(C,Z)  for  all  Z  €  V  (i.e.,  all  grammar  symbols),  and  adds 
to  the  collection  of  item  sets  any  nonempty  set  not  already 
present.  As  mentioned  above,  these  new  uncompleted  item 
sets  will  be  completed  in  their  turn. 

Thus  the  item  set  construction  algorithm  iteratively 
performs  the  two  operations  of  closure  and  successor.  The 
algorithm  is  guaranteed  to  halt  because  there  are  only  a 
finite  number  of  sets  of  items,  and  each  can  be  added  to  the 
collection  at  most  once. 

Once  we  have  computed  the  collection  of  sets  of  LR (k) 
items,  it  is  rather  simple  to  derive  the  table  entries. 

Given  a  closed  item  set  C,  we  compute  the  table  (f,g) 
associated  with  C  as  follows: 

1)  compute  the  parsing  action  function,  f: 

a)  for  each  [ A  ->  x»y,u]  €  C: 

1)  if  y  ^  e,  then  we  have  not  yet  recognized 
the  production  defining  nonterminal  A;  we 
therefore  set  f (s)  =  shift,  for  all 

s  €  e_free_FIRST_k  (yu)  .  ~ 

2)  if  y  =  e,  then  we  have  completely 
recognized  the  production  defining  A;  we 
set  f (u)  =  reduce  i,  where  A  ->  x  is  the 
ith  production,  unless  A  =  S*  and  x  =  S,  in 
which  case  we  set  f (u)  =  accept. 

b)  for  all  s  €  Vt*  such  that  |s|  <  k  and  f(s)  has 
not  been  given  a  value  by  (a)  above,  we  set 

f  (s)  =  error. 

2)  compute  the  goto  function,  g:  for  each  grammar 
symbol  Z: 

a)  if  T(C,Z)  is  nonempty,  set  g{Z)  equal  to  the 
name  of  the  table  associated  with  T{C,Z). 

b)  if  T(C,Z)  is  empty,  set  g(Z)  =  error. 

It  is  not  obvious  that  the  algorithms  described  above 
are  practically  i mplement able .  For  example,  although  the 
algorithm  computing  the  collection  of  item  sets  computes  a 
finite  number  of  item  sets,  it  may  generate  quite  a  large 
collection.  Anderson  [Anderson,  Eve  and  Horning  1973] 
reports  that  the  LR(1)  item  set  construction  algorithm,  when 
applied  to  a  grammar  for  ALGOL-60,  ran  out  of  memory  after 
the  creation  of  10,000  items  and  1237  item  sets.  However, 
as  we  mentioned  briefly  in  the  introduction,  several  table 
construction  methods  exist  which  make  the  LE  parsing 
algorithm  a  very  desirable  one.  We  will  consider  several  of 
these  methods  in  Section  4.3.  In  Appendix  F  we  present 
examples  of  the  above  algorithms  for  a  simple  grammar  with 
k  =  1. 
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2.4:  Semantic  Actions 


This  constructor  allows  semantic  actions  to  be  paired 
with  parsing  actions.  Thus  the  f  function  returns  an 
ordered  pair,  (parsing  action,  semantic  action  sequence), 
where  the  parsing  action  is  as  defined  above,  and  the 
semantic  action  sequence  is  defined  by  the  user  of  the 
constructor. 

Over  the  past  years,  several  notations  have  been 
developed  for  associating  syntax  and  semantics.  Most  denote 
syntax  by  some  context-free  grammar  notation,  notably  BNF, 
The  semantics  can  be  prose  [Naur  1963],  'interpretation* 
rules  [ Wirth  and  Weber  1966],  source  or  object  code  [Johnson 
1974],  [Barnard  1975],  proof  rules  [Ledgard  1974],  or 
•semantic  action  symbols*  which  stand  in  place  of  the 
source/object  code  or  proof  rules.  There  are  also  other 
methods,  notably  two-level  grammars  [Van  Wijngaarden  1969], 
[Baker  1972],  [Koster  1974],  There  are  also  versions  of 
attributed  translation  grammars,  such  as  Roster's  Affix 
grammars  [Koster  1971a].  A  good  survey  of  often-used 
techniques  is  provided  in  [Marcotty,  Ledgard  and  Bochmann 
1976]. 

Confining  our  attention  to  single-level  context-free 
grammars  for  the  description  of  syntax,  we  find  two 
different  notational  methods  in  common  usage  for  the 
association  of  semantics  with  syntax,  based  on  the  parsing 
method  to  be  us'ed.  Grammars  destined  for  deterministic  top- 
down  analysis  (i.e,,  the  LL(1)  method)  may  have  semantics  at 
any  point  of  the  right  hand  side  of  the  production  [Lewis 
and  Stearns  1968],  [Koster  1971b],  [Lewis,  Rosenkrantz  and 
Stearns  1973],  [Barnard  1975],  The  implication  is  that  when 
the  semantics  are  'reached'  in  the  production  being 
recognized,  they  are  immediately  processed  by  the  parser. 

The  notation  is  different,  however,  for  those  grammars 
to  be  used  in  a  deterministic  bottom-up  parser.  In  this 
case,  semantics  are  typically  associated  only  with  the 
completion  of  the  production.  The  reason  is  obvious:  in 
order  to  ensure  correct  semantics,  we  cannot  act  upon  any 
semantics  associated  with  a  particular  production  until  that 
production  has  been  recognized.  And  bottom-up  parsers  do 
not  recognize  a  production  until  the  end  of  the  string 
produced  by  the  production  is  reached. 

However,  the  need  sometimes  arises  for  the  association 
of  semantics  with  a  point  other  than  the  completion  of  some 
production.  The  method  normally  used  in  achieving  this 
effect  involves  the  following  steps: 
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1)  create  a  nev  nonterminal  symbol; 

2)  define  it  to  be  null; 

3)  associate  the  semantics  with  the  completion  of  the 
empty  production;  and 

4)  place  a  reference  to  the  new  nonterminal  symbol  in 
the  original  production  at  the  point  where  the 
semantics  are  desired. 

Similarly,  one  may  split  the  production  at  the  point 
where  semantics  are  to  be  added.  Then  a  new  nonterminal 
symbol  is  defined  to  be  the  first  part  of  the  original 
production  and  given  the  new  semantic  actions.  The  first 
part  of  the  original  production  is  then  replaced  by  a 
reference  to  the  new  nonterminal  symbol.  Thus  we  have 
effectively  associated  semantics  with  some  point  in  a 
production  other  than  its  completion  and  still  retained  the 
necessary  grammatical  form,  at  the  cost  of  one  new 
nonterminal  symbol,  one  new  production,  and  some  degree  of 
grammar  readability. 

In  either  case,  the  cost  at  parse  time  is  also  increased 
over  and  above  the  cost  due  to  the  actual  processing  of  the 
newly-added  semantics;  the  parser  must  perform  an  extra 
parsing  action,  a  reduction  of  some  string  to  the  new 
nonterminal  symbol,  every  time  the  semantics  are  to  be  acted 
upon.  Obviously,  as  the  number  of  positions  at  which 
semantics  are  desired  increases,  our  costs  increase.  It 
must  also  be  noted  that  the  introduction  of  intra-production 
semantics  is  very  inconvenient. 

Our  goal  in  this  system  is  to  provide  a  convenient 
method  of  associating  semantics  with  any  point  of  a 
production,  at  uniform  cost,  regardless  of  the  position  of 
association.  We  also  wish  to  avoid  extra  grammatical 
r eguirements,  such  as  new  nonterminals  and  productions, 
since  these  detract  from  the  ease  of  use  and  readability  of 
the  grammar.  We  have  therefore  chosen  the  notation  commonly 
used  in  top-down  parsing  systems,  namely  placing  a  semantic 
action  symbol  at  the  point  desired  in  the  production.  The 
parser  then  outputs  the  symbol  during  the  parse  of  a  string 
each  and  every  time  that  point  of  that  production  is 
reached.  We  demonstrate  the  notation  by  defining  a  program 
block  in  the  SUE. 8  language,  using  the  semantic  actions 
indent .  exdent,  and  new  line,  which  control  movement  of  the 
left  margin  and  logical  line  processing,  and  are  to  be  acted 
upon  by  the  source  code  paragrapher  provided  as  part  of  the 
parsing  system; 

program^block  ->  ’begin*  indent  new_line 

(  statement  new_line  ) + 
exdent  *  end*  * ; * 

The  parser  ’performs*  semantic  actions  as  follows;  when 
the  parse  attempt  of  an  input  string  reaches  a  point  in  a 
production  which  has  associated  semantics,  the  action 
symbols  are  processed  by  a  routine  called  the  parse  output 
processor.  This  routine  can  actually  execute  code  unique  to 
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each  particular  action  symbol  or  enqueue  the  symbol  for 
later  processing  by  a  semantic  analyzer  or  code  generator, 
depending  on  the  parsing  environment  and  the  particular 
semantic  action  symbol.  We  will  discuss  this  further  in 
Sections  3.3  and  3.4. 

Thus  the  grammars  we  process  can  be  characterized  as  a 
5-tuple  G  =  <Vn,  Vt,  Va,  S,  P>,  where 

Vn,  Vt,  and  S  are  as  defined  for  context-free  grammars; 

Va  is  a  finite  set  of  action  symbols,  disjoint  from  Vn 
and  Vt;  and 

P  is  a  set  of  productions  of  the  form  A  ->  z,  where 
A  €  Vn  and  z  €  (Vn  union  Vt  union  Va) *. 

We  will  treat  semantic  action  symbols  as  we  would  other 
symbols;  to  each  semantic  action  sequence  in  a  production, 
we  associate  a  lookahead  string  to  determine  the 
applicability  of  the  sequence. 

Assume  that  we  are  constructing  an  LE(1)  parser.  If  an 
action  sequence  is  to  the  left  of  a  terminal  symbol,  that 
terminal  is  the  lookahead  symbol  for  the  sequence.  If  an 
action  sequence  is  to  the  left  of  a  npnterminal  symbol  A, 
then  some  subset  of  (FIPST(A)  union  FOLLOW  (A) )  is  the  set  of 
lookahead  strings  for  the  sequence.  For  action  sequences 
that  are  not  to  the  left  of  other  grammar  symbols  in  the 
containing  production,  the  set  of  lookahead  strings  of  the 
action  sequence  is  some  subset  of  that  set  of  strings  that 
may  follow  the  nonterminal  symbol  defined  by  the  production. 
As  it  turns  out,  the  lookahead  strings  for  each  of  these 
three  cases  are  already  computed  by  the  item  set 
construction  algorithm,  so  very  little  extra  work  is 
required  to  allow  semantic  actions  in  the  canonical  model. 

The  first  two  cases  of  lookahead  strings  we  discussed 
above  help  motivate  the  methods  we  will  describe  in  the 
implementation  of  a  constructor  that  allows  embedded 
semantic  action  symbols.  These  methods  are  called  left 
association  and  propagation.  We  will  discuss  the  use  of 
these  methods  in  Section  4.4. 


2.4.1:  Syntax-Directed  Translation  Schemata 

We  next  relate  our  system  to  syntax-directed  translators 
and  their  schemata.  Most  of  our  terminology  is  taken  from 
[  Aho  and  Oilman  1973a']  and  [Lewis  and  Stearns  1968]. 

A  syntax-directed  translation  schema,  or  SDTS,  is  a  5- 
tuple  T  =  <Vn,  Vi,  Vo,  S,  S>,  where 

Vn  is  a  finite  set  of  nonterminal  symbols; 

Vi  is  a  finite  input  alphabet ; 
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Vo  is  a  finite  output  alphabet; 

S  €  Vn  is  the  start  symbol;  and 

H  is  a  finite  set  of  productions  of  the  form  A  z  {z'} 
where  A  €  Vn,  z  €  (Vn  union  Vi)*,  z»  6  (Vn  union  Vo)*, 
and  the  nonterminals  in  z*  are  a  permutation  of  those 
in  z:  to  each  nonterminal  symbol  of  z  there  is 
associated  an  identical  nonterminal  symbol  of  z'. 

In  the  production  A  ->  z  {z*},  z*  is  called  the 
translation  element  of  the  production. 

The  grammar  Gi  =  <Vn,  Vi,  S,  Pi>  formed  from  T  with 
Pi  =  {A  ->  z  I  A  z  {z*}  e  R}  is  called  the  underlying 
grammar  or  the  input  grammar.  Similarly, 

Go  =  <Vn,  Vo,  S,  Po>  with  Po  =  {A  ->  z*  |  A  ->  z  {z*}  €  P} 
is  called  the  output  grammar. 

If  in  every  production  A  z  {z*}  of  R,  associated 
nonterminal  symbols  occur  in  the  same  order  in  both  z  and 
z',  then  T  is  termed  a  simple  SDTS. 

Note  the  great  similarity  between  our  system  and  a 
simple  SDTS:  intuitively,  an  SDTS  is  a  grammar  with 
information  attached  to  each  production,  so  that  whenever 
the  production  is  used  in  deriving  a  terminal  string  of  the 
grammar,  the  production's  translation  element  is  used  in 
determining  part  of  the  output  sequence  associated  with  that 
part  of  the  input  string  generated  by  the  production. 

The  translation  element  associated  with  a  particular 
production  normally  comes  into  play  once  the  production  has 
been  used  in  a  derivation.  In  the  recognition  attempt  of  an 
input  string  by  an  LL(1)  parser  the  output  sentential  forms 
can  be  created  while  the  production  is  being  used  in  a 
derivation:  there  can  be  no  question  about  which  production 
is  being  applied,  and  therefore  no  necessity  to  wait  until 
the  whole  production  is  completed  before  generating  output. 

We  have  a  problem  with  LR  (k)  parsers,  however,  in  that 
the  parser  does  not  know  if  a  particular  production  has  been 
used  in  the  derivation  of  a  sentence  until  k  symbols  after 
the  last  symbol  derived  from  the  production.  This  motivates 
the  normal  restriction  of  association  of  semantics  with 
syntax  in  grammars  destined  for  bottom-up  recognition. 

We  avoid  this  problem  by  using  the  following  simple 
rule.  Suppose  the  parser  is  currently  using  a  table  whose 
associated  item  set  contains  several  items.  (There  is  no 
problem  in  associating  semantics  if  there  is  only  one  item 
in  an  item  set.)  Then  the  parser  is  simultaneously 
'recognizing'  several  productions.  If  several  of  these 
items  have  any  lookahead  strings  in  common,  then  we  simply 
require  that  identical  semantic  actions  be  associated  with 
the  transition  symbols  of  these  items.  If  the  semantic 
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actions  are  not  identical,  then  we  have  discovered  a 
semantic  action  conflict. 


2.4.2:  Comparing  Notations 

Let  us  compare  normal  SDTS  notation  with  the  notation 
commonly  used  in  top-down  systems,  to  which  we  subscribe. 

will  examine  the  case  of  placing  paragraphing  action 
symbols  in  the  grammar.  He  wish  to  coordinate  the  output  of 
these  symbols  (and  therefore  the  performance  of  the  actions) 
with  the  recognition  of  the  terminal  symbols.  The  normal 
notation  for  simple  transduction  elements,  e.g.  A  ->  abBcd 
{IBJ} ,  only  specifies  the  string  of  output  symbols  that 
precede  or  follow  nonterminal  symbols.  In  the  normal  view, 
the  timing  with  terminal  symbols  of  the  underlying  input 
grammar  is  immaterial,  because  output  occurs  after 
recognition.  The  intermeshing  of  input  and  output  grammars 
in  our  system  gives  us  the  desired  control  over  the  timing. 
He  could  achieve  the  same  effect  in  the  normal  notation  by 
repeating  ourselves,  if  the  terminal  symbols  of  the  input 
grammar  are  contained  in  the  output  alphabet:  A  abBcd 
{albBcdJ} ,  This  explicitly  states  that  a  and  b  are  output, 
with  I  between  them;  I  could  be  the  action  new_line,  or 
something  similar.  In  one  use  of  our  system,  we  may  assume 
that  terminal  symbols  of  the  input  grammar  are  contained  in 
the  output  alphabet,  and  are  to  be  output;  then  we  can 
specify  additional  actions  to  control  paragraphing,  and  not 
be  forced  to  repeat  ourselves:  A  aIbBcdJ  . 

The  implication  of  the  mixed  notation  is  that  we  have 
interwoven  the  processes  of  grammatical  analysis  and 
translation  output;  the  normal  SDTS  notation  does  not  imply 
this.  Thus  the  class  of  grammars  we  can  recognize  plays  a 
large  role  in  the  determination  of  the  translations  we  can 
perform.  We  will  have  more  to  say  about  this  in  Section 
4.4. 
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chapter  3 

The  Design  of  the  Parsing  System 


3.1:  Overview 


The  parsing  system  consists  of 


the  following  parts: 


1)  A  scanner; 

2)  A  screener; 

3)  A  parser; 

4)  A  parse  output  processor;  and 

5)  A  source  code  paragrapher. 


Our  parsing  system 
the  SP/k  system  [ Holt 
1975],  [Barnard  1975], 
between  the  scanner  an 


structure  is  modelled  a 
and  Wortman  1975],  [Hume 
with  a  more  explicit  in 
d  parser. 


fter  that  of 
and  Holt 
terface 


The  partitioning  of  the  parsing  system  is  very  much  a 
functional  decomposition.  The  interfaces  between  modules 
are  intended  to  be  as  small  as  possible  to  allow  the  user  of 
the  constructor  a  great  amount  of  freedom  to  modify  the 
parsing  system.  Quite  often,  every  new  language  to  be 
parsed  will  require  some  modifications  to  the  scanner  in 
order  to  recognize  a  new  class  of  language  tokens.  We  have 
attempted  to  separate  the  ideas  and  actions  of  the  source 
token  recognition  process  from  the  processing  of  parse 
tokens  by  utilizing  a  screener  between  the  scanner  and 
parser  [DeRemer  1974].  We  discuss  the  effects  of  this 
scheme  below. 


We  also  attempt  to  separate  as  much  as  possible  the 
actions  involved  in  parsing  and  those  involved  in  processing 
information  to  be  used  by  later  compiler  phases.  This 
separation  reflects  the  fact  that  the  parsing  process  is 
largely  independent  of  the  parser  environment,  whereas  the 
processing  of  semantic  actions,  language  tokens,  etc. ,  is 
yer]r  dependent  on  the  language  being  recognized  and  on  the 
environment  of  the  system.  This  separation  is  effected  by 
using  a  separate  module,  the  £arse  output  processor,  to 
perform  semantic  actions.  Thus  the  parser  itself  knows  very 
little  about  semantic  actions  —  it  simply  invokes  the  parse 
output  processor,  which  is  very  much  a  post-screener. 

The  scanner,  screener,  and  parse  output  processor  would 
be  typical  routines  requiring  various  amounts  of 
modification  (preferably  small  amounts)  for  each  different 
environment,  or  language  being  recognized.  For  example,  a 
new  language  may  require  the  addition  of  code  to  the  scanner 
in  order  to  recognize  classes  of  language  tokens  not 
previously  recognized.  And  each  new  set  of  tables  may 
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contain  different  semantic  actions,  the  definitions  of  which 
must  appear  in  the  parse  output  processor. 


Figure  3.1.1  shows  the  hierarchical  structure  of  the 
parsing  system.  The  ’ultimate  ancestor*  of  the  parsing 
system  is  OVEFSEEE;  all  other  routines  are  descendents  of 
the  parser.  OVEFSEER  initializes  the  parse  tables  and 
essentially  receives  all  the  information  from  the 
constructor  which  is  to  be  used  by  the  parsing  system. 


Figure  3.1.2  shows  data  communication 
modules  of  the  parsing  system,  as  well  as 
parsing  system  and  the  constructor.  Each 
in  the  appropriate  section  below. 


paths  among  the 
between  the 
path  is  discussed 
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A  Hierarchical  View 
of  the  Parsing  System 
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Figure  3.  1.  2 

The  Data  Communication  Paths 
of  the  Modules 

Key: 

(a) :  language  token  to  parse  token  mapping  function 

(b)  :  parse  tables,  semantic  action  table 

(c) :  semantic  action  codes 

(1) :  semantic  actions 

(2) :  paragraphing  actions 

(3)  :  parse  tokens 

(♦) :  characters  to  output  stream 

(5) :  comments,  language  tokens 

(6) :  output  to  later  compiler  phases 
(■^);  language  tokens 

(8):  characters  from  input  stream 


3.2:  The  Scanner 

The  scanner  is  responsible  for  the  distillation  of 
1 anguaqe  tokens  from  the  input  stream*  These  language 
tokens  include  such  things  as  identifiers,  numeric 
constants,  and  quoted  strings.  The  scanner  is  simply  a 
finite  state  machine.  By  examining  the  first  character  of 
the  input  string,  the  scanner  differentiates  tokens  into 
their  respective  classes,  for  all  classes  except  multi¬ 
character  and  single-character  special  symbols.  The  scanner 
returns  to  the  screener  both  the  string  of  characters 
comprising  the  language  token  just  read,  and  a  token  code 
denoting  the  class  to  which  the  token  belongs. 

The  intent  of  the  scanner  design  is  to  make  the  scanner 
as  modifiable  as  possible  and  at  the  same  time  to  make 
modification  unnecessary  in  most  cases.  This  is  achieved  to 
some  degree  by  defining  the  purpose  of  the  scanner  very 
narrowly:  it  simply  recognizes  source  language  tokens  and 
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returns  a  language  token  code  to  the  screener.  The  screener 
must  then  map  this  code  into  the  parse  token  code  expected 
by  the  parser. 

The  implemented  version  of  the  scanner  recognizes  the 
following  classes  of  language  tokens: 

a)  identifiers:  identifiers  are  tokens  that  begin  with  a 
letter,  or  one  of  the  characters  *$',  »#’,  or  *<2'. 

The  remaining  characters  of  the  identifier  may  be 
letters,  digits,  the  above  three  characters,  or 

15)  numeric  constants :  constants  are  a  sequence  of 
digits.  No  decimal  points  are  included. 

c)  multi-character  tokens:  these  include  such  tokens  as 
the  becomes  symbol  *:=*  and  the  relational  operator 

*<=* . 

guoted  strings:  arbitrary  text  surrounded  by  single 
quotes  (enclosed  single  quotes  must  be  written  as  two 
single  quotes) . 

e)  comments:  text  surrounded  by  V**  and  **/';  the  text 
may  not  contain  **/*. 

f)  special  characters :  •**,  '/*/  etc. 

The  user  of  the  constructor  must  provide  the  code 
segment  implementing  the  recognition  of  any  other  token 
classes.  For  example,  the  SUE. 8  scanner  recognizes  the 
multi-character  special  symbol  and  bit  constants, 

such  as  *"(4)F347"*.  Such  code  segments  are  quite  easy  to 
add  to  the  scanner. 


3.3:  The  Screener 

The  screener,  following  [DeRemer  1974],  is  the  interface 
between  the  parser  and  the  scanner.  The  parser,  when 
calling  for  the  next  input  symbol,  expects  to  receive  the 
code  for  a  particular  terminal  symbol  of  the  particular 
grammar  represented  by  the  parse  tables.  The  scanner 
language  token  code,  on  the  other  hand,  is  not  dependent  on 
the  particular  parse  tables  in  use.  The  screener  maps  the 
language  token  codes  into  terminal  symbol  codes,  and  enters 
all  identifiers  via  a  hash  technique  into  an  identifier 
table .  We  have  chosen  to  separate  the  names  of  identifiers 
from  their  attributes,  as  is  done  in  the  SP/k 
implementation.  Thus  in  a  compiler  environment  we  have  two 
separate  tables,  the  identifier  table  and  the  symbol  table. 
These  two  tables  would  not  necessarily  be  contained  in  the 
same  compiler  phase.  In  both  the  SP/k  and  SUE . 8 
implementations,  for  example,  the  identifier  table  is 
controlled  by  the  screener,  whereas  the  symbol  table  is 
controlled  by  the  semantic  analysis  phases  which  follow  the 
parsing  phase. 
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The  screener  also  intercepts  comments  and  directs  them 
to  the  paragrapher.  The  mapping  of  language  tokens  into 
parse  tokens  is  performed  by  utilizing  a  string  table,  which 
contains  the  keywords  and  special  characters  valid  for  the 
particular  language  represented  by  the  parse  tables.  This 
string  table  is  produced  for  the  screener  by  the 
constructor. 


3.4:  The  Parser 

Due  to  the  table-driven  parsing  algorithm,  the  parser  is 
a  simple  routine  which  performs  parsing  steps  as  defined  in 
Section  2.2.  The  implemented  version  of  the  parser  contains 
a  segment  of  code  quite  similar  to  the  following  SUE. 8  code: 

cycle 

Stack_curren t_table; 

Determine_parsing_action  (Current_input__symbol, 

Current_table) ; 


case  Possible_parsing_actions  tag  Parsing_action ; 


end ; 


end; 


Shift:  Perf orm_any_semantic_actions; 

Curr ent_table  := 

Next__table  (Current_input_sy  mbol, 
Current__^table)  ; 

Current_input^symbol  :=  Next_input_symbo 1 ; 
Deduce:  Per for m_any_sem anti c_act ions; 
Feduce_by_rule  (Rule#)  ; 

De  te  rm  in  e_t  a  ble__after_r  eduction ; 

Accept :  Perform_any_semantic_actions; 
exit ; 

Error:  Attempt_error_recovery; 

exit  unless  Successf ul_error_r ecover y; 


Most  lines  of  the  above  code  segment  are  macro 
invocations  to  access  the  data  structures  containing  the 
parsing  tables.  The  data  structures  actually  used  by  the 
parser  are  somewhat  different  than  the  tables  discussed  in 
Section  2. 3.  We  have  employed  a  list  representation  of  the 
sparse  tables  and  intend  to  perform  various  other 
optimizations  on  the  lists  in  order  to  compact  them  as  much 
as  possible.  Our  main  loop  would  be  identical,  however,  if 
we  had  directly  used  the  tables  in  matrix  form;  we  need  only 
modify  the  bodies  of  the  few  macros  that  access  the  data 
structures. 


3.5:  The  Parse  Output  Processor 

The  parse  output  processor  is  responsible  for 
interpreting  the  semantic  action  codes  'output'  by  the 
parser.  It  is  a  routine  requiring  some  amount  of 
modification  for  each  different  compiler  environment  and/or 
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set  of  semantic  actions.  Each  time  the  parser  is  to 
•perform'  a  semantic  action,  it  invokes  the  parse  output 
processor  with  the  name  of  the  semantic  action  as  a 
parameter.  Quite  often,  the  semantic  action  is  really  a 
pseudo-code  sequence  that  is  to  be  passed  on  to  the 
remaining  phases  of  the  compiler,  rather  than  an  action  to 
be  performed  at  parse  time.  The  parse  output  processor  can 
often  be  just  a  short  segment  of  code  that  places  its 
parameter  in  an  array,  or  writes  its  parameter  to  disk,  or 
simply  returns  control  to  a  different  compiler  phase, 
depending  on  the  method  of  passing  information  between 
compiler  phases  and  the  hierarchical  setup  of  the  compiler. 
The  implemented  version  of  this  system  separates 
paragraphing  actions  (as  defined  below)  from  other  semantic 
actions;  the  former  are  used  to  control  the  paragrapher,  and 
the  latter  are  passed  on  to  subsequent  compiler  phases. 


3.6:  The  Source  Code  Paragrapher 

The  source  code  paragrapher  provides  an  automatic 
paragraphing  facility.  The  particular  paragraphing  format 
employed  is  left  up  to  the  user  of  the  constructor.  In 
fact,  the  constructor  has  no  knowledge  of  paragraphing 
actions  in  and  of  themselves;  they  are  simply  semantic 
actions.  The  constructor  user  can  provide  a  wide  range  of 
special  purpose  paragraphing  actions.  The  parsing  system 
described  here  provides  the  basic  actions  new^page, 
new_line,  indent,  and  exdent. 

New_page  simply  causes  the  paragrapher  to  print  the 
current  logical  line,  and  then  to  eject  the  printer  to  the 
top  of  the  next  page. 

Indent  and  exdent  can  be  simply  defined  in  terms  of  an 
indentation  value,  which  is  just  the  number  of  columns  to  be 
skipped  over  when  starting  a  new  line.  Neither  indent  nor 
exdent  cause  any  carriage  motion  when  they  are  performed. 
Indent  increments  the  indentation  value,  and  exdent 
decrements  this  value.  Depending  on  the  desired  appearance 
of  the  output  listing,  the  increment/decrement  value  can  be 
either  a  simple  constant  or  a  complex  function  partially 
dependent  on  the  current  indentation  value.  The  current 
implementation  of  the  actions  uses  a  constant  increment 
value  of  3  columns. 

Newsline  causes  the  paragrapher  to  finish  the  current 
logical  line.  A  flag  is  also  set  to  cause  indentation  just 
before  the  printing  of  the  next  symbol.  This  causes  the 
action  sequences  "new_line  indent"  and  "indent  new__line"  to 
have  identical  effects. 

The  paragrapher *s  line  width  is  variable,  and  the 
paragrapher  automatically  allows  for  line  overflow.  This 
provides  for  such  facilities  as  'listing*  to  a  card  punch 
(thus  providing  a  paragraphed  source  deck) ,  or  listing 
source  code  that  is  proof-ready  for  publication. 
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For  a  more  in-depth  discussion  of  the  pros  and  cons  of 
automatic  paragraphing  of  source  text,  we  refer  the  reader 
to  [Gordon  1975]  and  [Barnard  1975],  We  have  chosen  to 
include  a  paragraphing  lister  in  our  system  for  the  extra 
reliability  and  readability  that  a  consistent  paragraphing 
method  seems  to  provide. 
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Chapter  4 

The  Design  of  an  Efficient  LALE  Parse  Table  Constructor 


An  Overview  of  the  Constructor 

In  this  chapter,  we  describe  the  design  of  a  practical 
parse  table  constructor.  We  avoid  the  issues  involved  in 
storage  structures  and  concentrate  instead  on  a  higher  level 
V  iew. 


We  begin  by  recalling  one  of  our  goals:  a  parse  table 
constructor  should  be  simple  to  use.  There  is  a  parallel 
between  the  development  of  parser  constructors  and  that  of 
an^  fairly  large  system:  too  often,  human  engineering  is 
left  for  later  consideration.  One  way  in  which  our  system 
caters  to  its  users  is  by  allowing  a  variety  of  grammar 
input  formats.  This  is  accomplished  by  using  the  parsing 
system  constructed  by  the  constructor  as  the  input  phase  for 
the  constructor.  By  substituting  a  different  set  of  parse 
tables,  a  different  grammar  format  can  be  employed.  The 
initial  system  must  be  bootstrapped. 

One  of  the  input  formats  provided  by  our  system  allows 
regular  expressions  on  the  right  hand  side  of  productions, 
which  we  feel  to  be  a  particularly  useful  addition  to 
constructors  due  to  the  added  convenience  and  readability 
often  afforded  by  such  a  format.  There  are  generally  fewer 
productions  and  nonterminal  symbols  in  a  grammar  that 
includes  regular  expressions  than  there  would  be  in  an 
equivalent  grammar  written  without  regular  expressions.  We 
discuss  the  grammar  input  and  transformation  process  in 
Section  4.1. 

A  system  should  also  do  more  than  simply  construct  parse 
tables  from  grammars,  as  mentioned  in  the  introduction.  The 
constructor  system  should  give  the  user  as  much  useful 
information  as  it  can  [Horning  1974b],  [Cohen  1975].  Our 
system  performs  an  analysis  of  the  grammar,  ensures  the 
grammar  is  in  reduced  form,  and  tests  for  simple  cases  of 
ambiguity.  We  discuss  this  subject  in  more  detail  in 
Section  4.2. 

In  Section  4.3,  we  discuss  the  three  methods  employed  by 
this  system  in  the  initial  table  construction  process, 
namely  the  construction  of  the  LE  (0)  collection  of  item  sets 
and  the  SLB(1)  and  LALE(1)  conflict  resolution  techniques 
[DeReroer  1969,  1971,  1974  ],  [  Aho  and  Oilman  1972a,  1973a], 

[ Aho  and  Johnson  1974].  We  end  the  section  with  a 
discussion  of  the  class  of  grammars  for  which  the  system  can 
construct  parse  tables:  as  the  title  of  the  report  states, 
the  system  accepts  the  class  of  LALR(1)  grammars.  We  prove 
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several  lemmas  about  LL(1)  grammars,  in  order  to  prepare  for 
a  theorem  in  the  next  section.  As  an  interesting  use  of  the 
lemmas,  we  prove  the  well-known  result  that  the  class  of 
LL{1)  grammars  is  a  subset  of  the  class  of  LALE(1)  grammars. 
We  also  show  that  the  class  of  LL(1)  grammars  is  not  a 
subset  of  the  class  of  SLP(1)  grammars.  Finally  we  show 
that  the  class  of  LL(1)  languages  is  a  subset  of  the  class 
of  SLE(1)  languages. 

Section  4.4  describes  the  modifications  necessary  to 
allow  the  introduction  of  semantic  action  symbols  associated 
with  points  in  a  production  other  than  the  production's 
completion.  We  discuss  the  ramifications  of  embedding 
semantic  actions,  and  techniques  allowing  an  efficient 
implementation  of  their  use,  such  as  left  association  and 
propagation.  We  also  discuss  a  method  of  determining  the 
points  in  each  production  where  distinct  semantic  action 
symbols  may  be  placed.  We  prove  that  our  system  can  process 
any  LL(1)  grammar  with  distinct  semantic  actions  at 
arbitrary  points  of  productions. 

Even  using  some  of  the  more  efficient  parse  table 
construction  algorithms,  the  tables  produced  are  fairly 
large.  Thus,  a  practical  system  requires  a  table 
optimization  phase.  Our  system  will  perform  some  of  the 
better  known  optimizations,  such  as  LR  (0)  reduce  state 
elimination,  the  use  of  list  representation,  the  use  of 
default  actions,  elimination  of  single  productions,  and 
overlaying  compatible  tables  in  the  list  representation 
[Anderson,  Eve  and  Horning  1971],  [ Aho  and  Ullman  1972b, 
1973b],  [Pager  1973a].  These  and  several  other 
optimizations  are  discussed  in  Section  4.5. 

In  the  final  section  in  this  chapter.  Section  4.6,  we 
discuss  constructor  output.  The  first  part  of  the  section 
describes  output  directed  to  the  user  from  each  phase  of  the 
constructor.  The  remaining  parts  discuss  output  headed  for 
the  screener  and  parse  output  processor,  as  well  as  the 
parser. 


4.1:  Input  and  Transformation 

Conceptually,  we  may  describe  the  input  and 
transformation  phase  as  a  set  of  routines,  each  reading  a 
particular  grammar  format,  and  transforming  the  format  into 
a  'normalized*  form.  The  user  specifies  a  particular  format 
at  the  time  of  input.  For  purposes  of  exposition  we  will 
assume  our  format  is  an  extended  version  of  Wortman's 
regular  expression  grammar  [Wortman  1973],  Other  input 
formats  are  discussed  in  Appendix  B, 

In  our  version  of  Wortman's  grammar,  productions  are 
separated  by  semicolons;  the  production  set  is  followed  by  a 
period.  Each  production  consists  of  three  parts: 
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1)  the  nonterminal  symbol  being  defined  by  the 
production; 

2)  the  meta-character  *=*,  which  is  synonymous  with 

BNF's  and 

3)  a  (possibly  empty)  list  of  nonterminal  symbols, 
terminal  symbols,  semantic  action  symbols,  and 
comments. 

Nonterminal  symbols  and  action  symbols  are  sequences  of 
letters,  digits,  and  underline  characters,  much  like 
identifiers  in  most  programming  languages.  Terminal  symbols 
are  either  identifiers  specifically  declared  as  such,  or  are 
quoted  strings.  A  comment  is  any  string  enclosed  in  the 
PL/I  comment  brackets  »/* *  and  **/*,  the  only  proviso  being 
the  obvious  one:  the  string  may  not  contain 

Several  productions  defining  the  same  nonterminal  symbol 
can  be  (and  for  readability’s  sake  should  be)  coalesced  into 
one  production  by  using  a  comma  as  the  delimiter  between 
alternatives.  Fepetition  of  sequences  of  symbols  in  the 
language  is  indicated  in  the  grammar  using  regular 
expression  notation:  the  sequence  is  enclosed  in 
parentheses  immediately  followed  by  either  a  plus  sign  meta¬ 
character,  indicating  that  the  sequence  appears  one  or  more 
times,  or  an  asterisk,  indicating  that  the  sequence  appears 
zero  or  more  times. 

Optionality  is  denoted  by  enclosing  the  optional 
sequence  in  parentheses  immediately  followed  by  a  question 
mark  *?*.  A  sequence  of  alternatives  may  also  be  embedded 
in  a  production  by  separating  the  alternatives  with  commas 
and  enclosing  the  entire  sequence  in  parentheses. 

As  an  example  of  the  above  definitions,  we  define 
Wortman’s  grammar  in  itself: 

grammar  =  rule  (  *;•  rule  )*  ; 

rule  =  nonterminal_sy mbol  *=* 

alternative  (  alternative  )*  ; 

alternative  =  (  nonterminal_symbol  , 

terminal_symbol  , 

’ (*  alternative  (  alternative  )*  ')’ 

(  *+'  ,  *♦*  ,  •?•  )? 

)*  . 

We  may  delineate  three  segments  in  the  overall  process 
of  input  and  transformation: 

1)  declaration  processing; 

2)  grammar  input;  and 

3)  transformation. 
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4.1.1;  Declaration  Processing 

systeni  reguires  a  declaration  of  all  symbol  sets. 
Most  table  constructors  determine  which  symbols  in  the 
production  set  of  the  grammar  are  nonterminal  and  which  are 
terminal  by  assuming  that  any  symbols  referenced  on  the 
right  hand  side  of  some  production  but  never  appearing  on 
the  left  hand  side  of  any  production  are  terminal  symbols ; 
all  the  rest  must  be  nonterminal  symbols.  Or  equivalently, 
all  the  symbols  appearing  on  the  left  hand  side  are  assumed 
to  be  nonterminal  symbols,  and  all  the  rest  are  assumed  to 
be  terminal  symbols.  In  either  case,  the  assumption 
Vn  +  Vt  =  V  is  made.  As  long  as  one  of  the  sets  can  be 
characterized  by  a  predicate  P,  the  other  is  simply  --P, 
within  the  universe  of  discourse  V.  This  assumption  is  not 
valid  in  this  constructor,  however;  a  set  of  semantic  action 
symbols  Va  also  exists.  Therefore,  we  require  a  declaration 
of  all  symbol  sets.  The  system  could  require  the 
declaration  of  just  the  action  symbol  set,  and  determine  the 
other  sets  as  discussed  above,  but  this  is  not  in  keeping 
with  good  software  design  principles. 

We  feel  that  this  redundancy  increases  the  reliability 
of  the  system:  there  is  less  chance  for  error  in  the 
determination  of  symbol  sets,  and  the  system  can  perform 
type  checking  on  references  to  symbols  in  the  productions  to 
help  prevent  their  inappropriate  use.  The  system  can  also 
inform  the  user  of  any  declared  symbols  which  are  never 
referenced,  and  has  the  information  necessary  to  detect 
misspelled  identifiers. 

It  should  be  pointed  out  that  the  user  need  only  declare 
those  terminal  symbols  not  surrounded  by  quotes  in  the 
grammar  (in  Wortman*s  notation).  The  grammar  input  step 
assumes  that  quoted  strings  in  the  grammar  represent 
terminal  strings  in  the  language. 

The  declaration  processor  builds  symbol  dictionaries, 
which  are  used  in  the  remaining  parts  of  the  system.  For 
the  most  part,  we  avoid  the  details  of  these  dictionaries  in 
our  discussions.  The  curious  reader  is  invited  to  read 
Appendix  A,  which  contains  the  formats  of  the  entries  in 
each  dictionary  and  each  table  in  the  constructor. 


4.1.2:  Grammar  Input 

Each  grammar  format  requires  a  different  set  of  tables 
to  be  used  by  the  parser,  and  usually  a  different 
transformation  routine.  The  parser  controls  all  the  input 
performed  by  the  constructor,  by  parsing  according  to  the 
following  grammar  of  the  constructor’s  ’input  language’; 
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possible^input  =  (  batch  )+  ; 

batch  =  (  option_statemen t  )* 

(  declaration  )  + 
grammar  ; 

declaration  =  'declare'  (  'nonterminal*  , 

'semantic*  'action*  , 

'terminal*  ) 

*('  identifier  (  ',*  identifier  )*  ')'  *;'  ; 

grammar  =  'goal*  'symbol*  'is*  identifier  *;* 

'grammar*  'is*  (  'wre*  *;*  wre_grammar  , 

*xvwn*  *;'  xvwn_grammar  , 

*bnf*  ';*  bnf_grammar  )  ; 

Figure  4,1.2 

A  Simplified  Grammar  of  Constructor  Input 


For  each  declaration  the  parser  calls  the  declaration 
processor  via  a  semantic  action.  We  have  omitted  the 
semantic  action  symbols  at  this  stage  for  simplicity;  the 
full  grammar  of  constructor  input  is  given  in  Appendix  D. 
After  the  declarations  have  been  processed,  the  parser  then 
reads  the  grammar  in  the  specified  format.  The  above 
grammar  defines  three  of  these  formats:  Wortman's  regular 
expression  grammar  (WEE) ,  extended  Van  Wi jngaarden  notation 
(XVWN;  see  [Van  Wijngaarden  1969]  and  Appendix  B) ,  and  BNF 
[Naur  1963].  Users  may  of  course  define  their  own  by 
providing  a  grammar  written  in  one  of  these  three  grammars 
and  running  it  through  the  constructor.  In  fact,  the  system 
was  bootstrapped  initially  by  hand-encoding  the  internal 
representation  of  WRE,  The  system  then  produced  tables 
which  could  be  used  to  parse  grammars  written  in  WRE.  It 
was  then  a  simple  matter  to  incrementally  add  the  grammars 
of  the  other  input  formats. 

The  parsing  system  includes  a  paragraphing  lister,  as 
was  briefly  discussed  in  Section  3.5.  This  lister  can  be 
used  to  provide  a  listing  of  the  constructor  input  data  in  a 
paragraphed  form.  Other  listing  options  are  discussed 
below . 

The  use  of  a  grammar  to  describe  the  input  to  the  parser 
allows  us  to  treat  the  constructor  input  and  output  in  a 
single  conceptual  framework,  as  well  as  to  show  off  the 
power  of  such  a  system  by  its  ease  of  extendability  to  other 
grammar  presentation  methods.  of  course,  a  drawback  to  this 
whole  process  is  the  fact  that,  if  the  table  construction 
phase  is  to  be  efficient,  then  we  should  expect  a^  input 
formats  to  be  placed  in  a  'normal  form*  before  the  table 
constructor  phase  processes  the  grammar.  Therefore,  each 
input  method  requires  its  own  transformation  routine. 
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4.1.3:  Transformation 

4.1. 3.1:  The  Need  for  a  Normal  Form 

Our  'normal  form'  is  an  internalized  representation  of 
BNF.  The  requirement  of  some  normal  form  is  an  obvious  one, 
for  we  desire  a  high  degree  of  efficiency  in  the  table 
construction  process.  Saddling  the  construction  phase  with 
the  extra  requirement  of  accepting  a  variety  of  grammar 
formats  can  only  slow  the  process.  But  it  is  not  as  obvious 
just  what  the  normal  form  should  be.  It  certainly  should  be 
at  least  as  powerful  as  BNF:  we  must  be  able  to  handle 
recursion.  The  real  question  is,  need  the  normal  form  be  as 
powerful  as,  say,  Nortman's  regular  expression  grammar?  Put 
from  a  slightly  different  angle,  this  question  becomes,  can 
WG  modify  the  table  construction  algorithms  to  process  a 
grammar  with  regular  expressions  on  the  right  hand  side  of 
productions,  and  if  so,  is  this  modification  worth  the 
trouble?  The  answer  to  the  first  part  of  this  question  is 
'yes'.  The  table  construction  process  can  be  modified  to 
deal  efficiently  with  regular  expressions  [Earley  1970], 
[DeRemer  1974],  [LaLonde  1975].  The  answer  to  the  second 
part  of  the  question,  however,  is  not  as  clear.  In  order  to 
ensure  a  solid  base  for  our  system,  we  'played  it  safe'  and 
accepted  BNF  as  our  normal  form.  Thus,  each  grammar 
presentation  method  requires  a  transformation  into  BNF, 
i.e. ,  into  a  grammar  whose  productions  contain  no  meta¬ 
parentheses.  We  discuss  in  the  following  paragraphs  the 
transformation  of  WPE  into  BNF;  Appendix  B  provides  the 
curious  reader  with  the  details  of  XVWN  transformation. 


4. 1.3.2:  Transformation  Methods 

All  the  transformation  methods  have  in  common  the 
property  that  they  reduce  intra-production  complexity  by 
substituting  a  simple  sequence  of  new  nonterminal  symbols 
for  some  complex  symbol  sequence.  These  new  nonterminal 
symbols  are  then  given  as  their  definition  a  series  of 
productions  equivalent  to  the  symbol  sequences  which  the 
nonterminals  replaced  in  the  original  production.  For 
example,  the  production  defining  the  nonterminal  symbol 
declaration  could  be  replaced  by  a  sequence  of  three 
productions,  as  shown  below. 
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declaration  =  'declare*  A  *(*  identifier  B  ')*  *;*  ; 

A  =  'nonterminal*  , 

'semantic*  'action*  , 

•terminal*  ; 

B  =  , 

B  *,*  identifier  ; 

Figure  4.1.3a 

The  Series  of  BNF  Productions 
Defining  the  Nonterminal  Symbol  declaration 


These  productions  generate  the  same  set  of  terminal  strings, 
but  conform  to  a  simpler  form.  Left  recursion  has  replaced 
regular  expression  iteration. 

The  WPE  to  BNF  transformation  process  uses  the  fact  that 
each  regular  expression  production  can  be  written  in  the 
form 

N  =  hi  (  g1  )x1  h2  (  q2  )x2  ...  hn  (  qn  )xn  w  ; 

where  N  is  the  nonterminal  being  defined  by  the  production, 
and  we  have  grouped  the  symbols  on  the  right  hand  side  of 
the  production  based  on  the  meta- parentheses  in  the 
production,  as  follows: 

each  hi,  for  1<i<n,  and  w  are  simple  sequences  of 
symbols  not  containing  the  meta-symbols  left 
parenthesis,  right  parenthesis,  or  comma; 

each  qi,  for  1<i<n,  is  an  arbitrarily  complex  sequence 
of  alternatives;  and 

each  xi,  for  1<i<n,  is  an  element  of  the  set 

{*♦*,  *+*,  e} ;  i.e.,  each  xi  is  a  regular 
expression  operator  or  is  null. 

As  an  example  of  the  above  form,  we  demonstrate  the 
parts  of  the  production  defining  the  nonterminal  symbol 
declaration: 
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declaration  =  'declare*  (  'nonterminal'  , 

'semantic*  'action* 

'terminal*  ) 

»(*  identifier  (  *,*  identifier  )*  *)*  *;* 

N  =  declaration 
hi  =  'declare* 

q1  =  'nonterminal'  ,  'semantic*  'action*  ,  'terminal* 
x1  =  null 

h.2  -  '  (*  identifier 
g2  =  *,*  identifier 

x2  =  *** 
w  =  ') *  '  ;* 

and  the  production  is  N  hi  (  g1  ) x1  h2  (  g2  ) x2  w 

Figure  4.1.3b 
The  Parts  of  a  Production 


The  parts  of  the  production  that  must  be  transformed  are 
confined  to  the  gi,  where  for  each  qi  the  corresponding  xi 
determines  the  choice  of  transformations  to  be  applied.  We 
next  delve  into  the  question  of  which  transformations  should 
be  applied. 

The  transformation  process  is  composed  of  subprocesses: 

1)  'simplification*  of  the  gi  of  a  production;  and 

2)  replacement  of  regular  expression  iteration  with 
recursion. 

The  first  step  creates  new  productions  that  may 
themselves  be  complex  regular  expressions,  to  which  the 
transformation  process  must  be  applied  in  turn.  After  the 
complex  sequences  have  been  removed  by  the  first  step, 
however,  the  second  step  is  guaranteed  to  produce 
productions  needing  no  further  processing. 

Thus  the  first  step  in  the  transformation  process  which 
we  apply  to  each  production  is  the  simplification  of  the  qi, 
by  replacing  each  embedded  complex  sequence  by  a  new 
nonterminal  symbol.  A  new  production  is  then  added  to  the 
production  set.  This  new  production  defines  the  new 
nonterminal  symbol  to  be  the  complex  sequence  that  the 
nonterminal  replaced  in  the  original  production.  The 
complex  sequences  may  include  alternatives  and  regular 
expressions.  This  is  exemplified  by  the  definition  of  the 
nonterminal  symbol  alternative  in  Wortman's  regular 
expression  grammar.  That  production  is  also  a  good  example 
of  an  embedded  alternative  sequence  for  which  the 
surrounding  parentheses  can  not  yet  be  removed,  because  of 
the  '**  regular  operator  following  the  last  right 
parenthesis  meta-symbol.  The  first  transformation  step 
applied  to  alternative  produces  Figure  4.1.3c. 
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alternative  =  (  C  ) *  ; 

C  =  nonterminal  , 
terminal  , 

'  ('  alternative  (  *,*  alternative  )*  ')' 

(  **•  r  •+'  ,  *?*  )?  ; 


Figure  4, 1 . 3c 

Applying  Step  1  of  the  Transformation  Process 
to  the  Nonterminal  Symbol  alternative 


To  perform  the  second  transformation  step,  we  substitute 
for  each  qi  and  xi  some  series  of  productions  defining  the 
same  terminal  strings  using  recursion  rather  than  iteration. 
In  the  general  case,  qi  could  have  originally  consisted  of 
an  embedded  alternative  sequence;  but  the  first 
transformation  step  just  applied  would  have  replaced  it  with 
a  single  nonterminal.  We  may  therefore  assume  that,  at  this 
point,  each  qi  in  this  production  is  in  the  same  simple  form 
demanded  originally  of  both  the  hi  and  w.  For  the  regular 
expression  (  qi  )+  we  can  substitute  a  new  nonterminal  B, 
and  define  B  by  the  two  productions 

B  ->  qi 

B  B  qi 

Similarly,  for  (  qi  )  *  we  can  substitute  B,  with  definitions 

B  ->  e 

B  ->  B  qi 

(We  remark  that  we  have  used  left  recursion  rather  than 
right  recursion  to  enable  the  constructor  to  process  a 
slightly  larger  class  of  grammars;  the  reader  is  invited  to 
read  Appendix  C  for  further  discussion.) 

Likewise,  for  (  qi  )?  we  can  substitute  B  with 
definitions 

B  e 

B  E  qi 

Applying  this  second  step  of  the  transformation  process 
to  declaration  would  result  in  the  production  sequence  shown 
in  Figure  4.1.3a.  The  case  of  alternative  is  more 
interesting,  in  that  it  illustrates  some  of  the 
inefficiencies  of  the  process.  Before  we  can  apply  step  2 
to  the  nonterminal  symbol  C  of  Figure  4.1.3c,  however,  we 
must  apply  step  1.  We  obtain: 
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alternative  =  (  C  )  *  ; 


C  =  nonterminal  , 
terminal  , 

'  (*  alternative  (  alternative  )*  ')*  (  D  )?  ; 


D 


r 

I  I  • 
* 


Figure  4.1.3d 

Applying  Step  1  Repeatedly 


We  then  apply  step  2  to  obtain; 
alternative  =  E  ; 

E  =  r 

E  C  ; 

C  =  nonterminal  , 
terminal  , 

'  (*  alternative  F  *)  '  G  ; 
F  =  r 

F  ' ,  '  alternative  ; 

G  =  , 

D  ; 

D  =  '**  , 

*  +  *  r 

I  I  . 

•  • 


Figure  4. 1 . 3e 

The  Result  of  Transformation 
of  the  Nonterminal  Symbol  alternative 


The  algorithm  implementing  the  two  steps  in  the 
transformation  process  takes  as  input  a  list  of  productions. 
Associated  with  each  production  is  a  flag  stating  whether  or 
not  the  production  should  be  tested  by  the  algorithm. 
Initially,  all  productions  are  flagged,  indicating  that  all 
productions  should  be  tested.  As  new  productions  are 
created  in  the  application  of  the  steps,  they  are  appended 
to  the  list;  those  added  by  step  1  are  flagged,  to  ensure 
that  they  will  also  be  tested.  The  algorithm  is  guaranteed 
to  terminate  because  there  are  a  finite  number  of  pairs  of 
parentheses  in  the  grammar. 
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This  algorithm  can  be  improved  somewhat  if  w  is  null  in 
the  canonical  transformation  form.  We  can  then  avoid 
creation  of  nonterminal  symbols  such  as  E  in  Figure  4,1.3e. 


4.2:  Grammatical  Analysis 

This  phase  checks  the  grammar  for  the  more  easily-found 
types  of  ambiguity  and  flags  useless  symbols.  Several 
relations  are  derived  among  grammar  symbols  that  are  useful 
in  the  construction  phase,  and  are  often  desired  by  the  user 
as  a  kind  of  semantic  analysis  of  the  grammar.  The  analysis 
consists  of  the  following  processes: 

1)  The  cross-reference  information  of  the  symbols  in  the 
grammar  is  derived.  The  information  derived  consists 
of  a  list,  for  each  symbol,  of  the  productions  in 
which  a  reference  to  the  symbol  appears.  The  lists 
are  in  alphabetical  order. 

2)  The  grammar  is  inspected  for  simultaneous  left  and 
right  recursion.  This  inspection  necessitates  the 
computation  of  the  sets  HEADS  and  TAILS. 

3)  The  set  of  nonterminal  symbols  that  can  produce  the 
null  string  is  determined.  This  set  is  used  in  the 
computation  of  the  HEADS  and  TAILS  sets,  as  discussed 
below . 

4)  Any  useless  symbols  and  productions  in  the  grammar 
are  found  and  deleted. 

5)  The  FIRST  and  FOLLOW  sets  for  each  nonterminal  symbol 
are  determined. 


4.2.1:  Cross  Referencing 

In  this  first  substep  of  grammatical  analysis,  a  cross 
reference  map  of  all  symbols  is  produced,  listing  all 
productions  referencing  each  symbol.  This  information  is 
mainly  provided  for  the  user,  but  it  is  used  to  good 
advantage  in  several  algorithms  deriving  other  relations 
(such  as  the  NULL  set  determination  algorithm) .  Also,  for 
reasons  of  efficiency,  various  queues  and  Boolean  matrices 
are  created.  The  use  of  these  queues  and  matrices  is 
explained  in  greater  detail  below. 

The  grammar  can  optionally  be  listed  at  this  point  in  a 
'canonical*  listing  akin  to  a  depth-first  search.  An 
example  of  a  canonical  listing  is  contained  in  Appendix  E; 
the  example  is  the  grammar  of  the  SUE. 8  language,  with 
semantic  actions  embedded  in  the  grammar. 


4.2.2:  Simultaneous  Left  and  Right  Recursion 
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A  nonterminal  symbol  is  simultaneously  left  and  right 
recursive  if  it  is  a  member  of  both  its  HEADS  and  its  TAILS 
sets.  A  grammar  containing  such  a  nonterminal  symbol  is 
ambiguous.  In  order  to  check  for  simultaneous  left  and 
right  recursion,  we  must  compute  the  sets  of  heads  and  tails 
of  each  nonterminal  symbol. 

The  computation  of  the  HEADS  sets  can  be  performed  quite 
efficiently  if  no  nonterminals  can  produce  the  null  string, 
by  computing  the  transitive  closure  of  the  initial  heads  of 
each  nonterminal  symbol  using  either  Warshall's  algorithm 
[Warshall  1962]  or  Purdom's  [Purdom  1970].  Using  a  Boolean 
matrix  HEADS  of  nonterminal  symbols  versus  nonterminal 
symbols,  we  initialize  HEADS (A, B)  =  true  if  nonterminal  B  is 
a  production  head  of  nonterminal  A,  and  false  otherwise.  In 
other  words,  HEADS (A, B)  =  true  implies  that  there  is  a 
production  in  P  of  the  form 

A  B  z 

where  A,B  €  Vn,  and  z  €  V*.  What  we  wish  to  compute,  for 
each  nonterminal  symbol  A,  is  the  set  of  all  nonterminal 
symbols  that  may  possibly  begin  some  string  generated  by  A. 
If  we  have  the  productions 

A  ->  B  z 
B  ->  C  y 
C  X 

then  HEADS (A)  =  {B,C}.  For  these  productions  the  HEADS 
matrix  would  initially  be 

ABC 

A  • 

B  • 

C 

Computing  the  transitive  closure  of  the  matrix,  we  have 

ABC 

A  •  • 

B  •  ,  i.e.,  HEADS  (A)  =  {B,C} 

C 

The  computation  of  the  TAILS  sets  from  the  TAILS  matrix  is 
very  similar. 

The  existence  of  nonterminal  symbols  that  produce  the 
null  string  complicates  matters  somewhat.  Consider  the 
sequence  of  productions 

A  ->  B  D  X 
B  ->  C 
I  e 

D  y 
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Here,  D  €  HEADS (A)  because  B  may  generate  the  null  string  e. 
However,  it  is  obvious  from  inspecting  the  productions  that 
the  transitive  closure  of  the  initial  head  set  of  A  (which 
is  {B})  will  not  include  D.  Therefore  we  must  modify  our 
matrix  initialization  process  to  inspect  the  initial  head 
set  of  each  nonterminal  symbol  as  the  set  is  being  created. 
If  a  head  symbol  can  produce  the  null  string,  the 
nonterminal  symbol  immediately  following  it  in  the 
production  (if  such  a  nonterminal  symbol  exists)  is  also 
included  in  the  initial  heads  matrix.  If  this  symbol  in  turn 
can  produce  the  null  string,  then  the  nonterminal  symbol 
following  it  is  included,  and  so  on.  When  this 
initialization  process  is  completed,  the  transitive  closure 
operation  may  be  applied. 

After  both  the  HEADS  and  TAILS  matrices  representing  the 
sets  have  been  computed,  the  scalar  product  of  their  main 
diagonals  informs  us  if  some  particular  nonterminal  symbol 
is  both  a  possible  head  and  a  possible  tail  of  itself, 
indicating  simultaneous  left  and  right  recursion,  and  hence 
ambiguity. 


4,2.3:  Computing  the  NULL  Set 

As  discussed  in  the  previous  paragraphs,  to  correctly 
compute  the  HEADS  and  TAILS  sets  we  must  have  at  our 
disposal  a  list  of  the  nonterminal  symbols  that  may  possibly 
generate  the  null  string.  Such  nonterminal  symbols  are 
termed  nullable.  This  computation  is  in  itself  nontrivial. 
Several  solutions  exist  in  the  literature,  however. 

[ Ginsburg  1966]  and  [ Aho  and  Oilman  1972a]  give  essentially 
identical  algorithms  for  determining  the  set  of  nonterminal 
symbols  that  generate  terminal  strings;  either  algorithm  may 
be  slightly  modified  to  determine  the  NULL  set  by  modifying 
the  method  of  handling  terminal  symbols,  as  [Knuth  1971] 
observes. 

The  basic  algorithm  (following  [Knuth  1971])  associates 
with  every  nonterminal  symbol  a  mark.  A  nonterminal  symbol 
is  marked  when  the  algorithm  determines  that  the  nonterminal 
symbol  can  produce  the  null  string.  Initially,  all 
nonterminal  symbols  directly  producing  the  null  string  are 
marked;  all  other  symbols  are  unmarked.  The  algorithm 
repeatedly  attempts  to  find  a  production  satisfying  two 
conditions: 

1)  the  nonterminal  symbol  defined  by  the  production 
is  unmarked;  and 

2)  all  the  symbols  on  the  right  hand  side  are 
marked. 

If  no  such  production  is  found,  the  process  halts;  if  one  is 
found,  then  the  nonterminal  symbol  defined  by  the  production 
is  marked,  and  the  process  is  repeated. 
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As  pointed  out  above,  this  same  basic  algorithm  can  be 
used  to  determine  those  nonterminal  symbols  which  produce 
terminal  strings,  by  simply  regarding  terminal  symbols  as 
marked  rather  than  unmarked. 

The  basic  algorithm  given  above  does  not  tell  us  how  to 
find  a  production  satisfying  the  two  conditions  necessary  to 
mark  the  nonterminal  symbol  defined  by  the  production.  The 
simplest  method  is  the  obvious  linear  search:  start  at  the 
beginning  of  the  list  of  productions,  and  examine  each 
production  in  turn.  When  a  nonterminal  symbol  is  marked  by 
the  algorithm,  simply  return  to  the  beginning  of  the 
production  list,  and  start  anew.  The  algorithm  terminates 
when  it  finally  *  falls  off  the  end'  of  the  production  list. 
The  problem  with  this  approach  is  its  inefficiency  --  the 
algorithm  needlessly  re-examines  many  productions,  the 
symbols  in  the  right  hand  side  of  which  have  not  been  marked 
in  the  meantime. 

A  much  more  efficient  approach  is  to  use  the  nonterminal 
symbols  and  the  cross  referencing  information  to  direct  the 
search  of  the  production  list.  First,  we  limit  our 
examination  to  just  those  productions  that  reference  marked 
nonterminal  symbols.  For  each  newly  marked  nonterminal 
symbol,  we  examine  all  the  productions  that  reference  it  in 
their  right  hand  sides.  If  this  production  examination 
fails  to  mark  any  new  nonterminal  symbols,  we  simply  go  on 
to  the  next  marked  nonterminal  symbol.  If  one  or  more 
nonterminal  symbols  are  marked  by  this  examination,  then  the 
defining  productions  of  these  newly  marked  nonterminal 
symbols  must  also  be  examined.  In  either  case  the  current 
nonteririnal  symbol  need  never  be  looked  at  again.  If 
another  symbol  on  the  right  hand  side  of  the  same  production 
becomes  marked,  then  we  will  re-examine  this  production  when 
we  examine  all  the  productions  containing  references  to  the 
newly  marked  nonterminal  symbol.  In  fact,  by  employing  a 
queue  of  marked  nonterminal  symbols  remaining  to  be 
examined,  we  need  not  ever  examine  many  nonterminal  symbols. 
Any  nonterminals  directly  producing  the  null  string  are 
determined  while  producing  the  cross  reference  map:  the 
initial  queue  contains  these  nonterminal  symbols.  If  the 
algorithm  marks  a  nonterminal  symbol,  it  is  added  to  the 
queue.  When  the  queue  is  empty,  the  algorithm  terminates, 
and  all  nullable  nonterminal  symbols  will  have  been  marked. 


4,2.4:  Grammar  Reduction 

We  wish  to  ensure  the  grammar  contains  no  useless 
symbols  or  productions  —  ones  that  can  never  be  used  in 
recognizing  sentences  in  the  language  represented  by  the 
grammar.  The  constructor  normally  operates  under  the 
assumption  that  each  and  every  symbol  and  production  in  the 
grammar  plays  some  role  in  the  process  of  deriving  the  set 
of  terminal  strings  generated  by  the  grammar;  this 
assumption  is  verified  by  the  grammar  reduction  substep. 
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We  first  characterize  the  properties  of  a  useless 
nonterminal  symbol.  A  nonterminal  symbol  A  is  termed 
unreachable  if  A  does  not  appear  in  any  sentential  form: 
^(-3u,v  €  V*)  [S  =>*  uAv].  A  is  termed  not  grounded  if  A 
does  not  generate  any  terminal  strings:  -•  (iu  6  Vt*) 

[A  =>*  u].  Finally,  A  is  termed  useless  if  A  is 
unreachable,  not  grounded,  or  both. 

Our  grammar  reduction  algorithm  is  essentially  an 
i ntple mentation  of  that  given  in  [  Aho  and  Ullman  1972a]  (or 
fGinsburg  1966  ]).  The  algorithm  first  computes  from  the 
grammar  the  set  of  all  grounded  nonterminal  symbols.  Any 
nonterminal  symbols  not  grounded,  along  with  their  defining 
and  referencing  productions,  are  deleted  from  the 
dictionaries,  and  the  symbols  and  productions  so  deleted  are 
reported  to  the  user.  Some  terminal  and  semantic  action 
symbols  may  become  unreachable  by  this  operation,  by  virtue 
of  the  fact  that  they  are  referenced  only  by  ungrounded 
nonterminal  symbols  which  have  been  deleted.  These  are  also 
reported  to  the  user.  The  algorithm  next  computes  the  set 
of  reachable  nonterminal  symbols  from  the  modified 
dictionaries,  and,  as  before,  any  unreachable  symbols  and 
their  productions  are  deleted  from  the  dictionaries,  and  the 
user  notified  of  the  action  taken. 

Note  that  the  algorithm  must  compute  the  set  of  grounded 
nonterminals  first,  and  not  vice  versa;  there  may  be  a 
grounded  symbol  X  which  is  reachable  only  through  an 
ungrounded  but  reachable  nonterminal  symbol  A,  As  the 
following  example  demonstrates,  if  we  compute  the  reachable 
nonterminal  set  first,  we  would  find  that  A  and  X  are  both 
members;  then  computing  the  grounded  set,  A  would  be  deleted 
as  ungrounded,  along  with  its  defining  productions,  thus 
making  X  unreachable.  This  fact  would  go  unnoticed  by  the 
algorithm  and,  more  importantly,  unreported  to  the  user. 

G  =  <Vn  =  {S, A,X} , 

Vt  =  {y,z}  , 

S  €  Vn, 

p  =  {(S,y),  (S,A),  (A, AX),  (X,z)}>, 

G'  =  GEODND (G)  =  <Vn  =  {S,X}, 

Vt  =  {y,z} , 

S  €  Vn, 

P  =  {  (S,y) ,  (X,z) } >,  and 

G"  =  REACH(GM  =  <{S},  {y}  ,  S,  {(S,y)}>. 

but  unfortunately, 

G1  =  REACH (G)  =  G,  and 

G2  =  GRODND(GI)  =  GEOUND(G)  =  G  •  G". 

The  algorithm  computing  GROUND(G)  is  almost  identical  to 
that  computing  NULL(G).  All  the  GROUND  algorithm  need  do 
differently  is  regard  terminal  symbols  as  marked.  As  in  the 
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NULL  set  algorithm,  we  set  up  an  initial  queue  of  all 
initially  marked  nonterminals  —  those  which  directly 
produce  a  terminal  string.  When  a  nonterminal  symbol  is 
marked  by  the  algorithm,  it  is  appended  to  the  queue;  when 
the  queue  is  empty,  the  algorithm  terminates.  Upon 
termination,  we  will  have  for  each  nonterminal  symbol  an 
indication  of  whether  or  not  that  nonterminal  symbol  can 
possibly  produce  any  terminal  string. 

The  REACH  algorithm  could  again  be  similar  to  that  of 
GROUND.  The  initial  queue  would  consist  of  simply  the  goal 
symbol.  Symbols  would  be  marked  when  they  are  referenced  in 
the  defining  productions  of  marked  nonterminal  symbols. 

When  the  algorithm  terminated,  any  nonterminal  symbol  not 
marked  would  be  unreachable. 

In  fact,  there  is  another  method  that  is  quite  similar 
to  the  one  used  in  the  computation  of  the  HEADS  and  TAILS 
sets  that  we  discussed  in  Section  4.2.2.  Suppose  we  have  a 
Boolean  matrix  REACHABLE  of  nonterminal  symbols  versus 
nonterminal  symbols,  with  REACHABLE  (A, B)  =  true  if 
nonterminal  B  was  in  some  A-sentential  form,  and  false 
otherwise:  REACHABLE (A, B)  implies  A  =>*  uBv  for  A,B  €  Vn, 

and  u,v  €  V*.  Then  the  set  we  are  interested  in  is  {X  €  Vn 
j  REACHABLE  (S,X)  =  true} . 

The  desired  Boolean  matrix  is  quite  easily  computed  as 
the  transitive  closure  of  an  initial  matrix  INITI AL_EEACH. 
INITI AL_REACH  (A, B)  =  true  if  there  is  a  reference  to 
nonterminal  symbol  B  in  a  production  defining  nonterminal 
symbol  A;  that  is,  if  there  is  in  P  a  production  A  ->  uBv 
where  A,B  €  Vn,  and  u,v  €  V*.  This  reference  information  is 
exactly  what  we  computed  in  the  cross  reference  computation, 
minus  any  symbols  deleted  in  the  meantime  as  a  result  of  not 
being  grounded. 


4.3:  Practical,  Efficient  Algorithms  for  Parse  Table 
Construction 

In  this  section  we  discuss  two  topics  which  help  make 
the  construction  process  more  practical.  The  first  involves 
computing  the  minimum  collection  of  item  sets.  The  second 
consists  of  computing  the  closure  of  each  nonterminal  symbol 
during  the  grammatical  analysis  phase,  and  thus  simplifying 
the  item  set  closure  algorithm. 

Before  turning  our  attention  to  practical  algorithms  for 
parse  table  construction,  we  consider  the  reasons  why  the 
canonical  method  generates  such  a  large  number  of  item  sets. 
The  lookahead  string  u  of  an  LR  (k)  item  [  A  ->  x«y,u]  gives 
one  of  the  terminal  strings  that  we  can  expect  to  see  after 
the  completion  of  the  production  A  ->  xy.  For  every  use  of 
the  nonterminal  symbol  A  in  the  right  hand  side  of  some 
production,  we  will  have  a  sequence  of  item  sets 
[A  ->  •xy,u1],  ...,  [A  x»y,u1],  ...,  [A  ->  xy«,u1],  for 
each  legal  lookahead  string  u1.  The  more  references  to  A 
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there  are  in  the  grammar,  the  more  item  sets  there  will  be. 
Since  each  reference  has  a  different  right  context,  these 
sequences  of  items  differ  from  each  other  only  in  the 
lookahead  strings.  If  k  =  1,  then  these  lookahead  strings 
play  no  role  in  the  determination  of  the  parsing  action 
until  the  final  item  [ A  ->  xy«,u]  has  been  reached.  Recall 
that  the  table  generation  algorithm  for  an  item  set 
constructs  the  parsing  action  function  f  from  the  formula: 

if  [A  ->  x*y,u]  €  C,  and  y  e,  then  f  (v)  =  shift  for 

all  V  e  e_free_FIRST_k (yu) . 

If  k  =  1  and  y  #  e,  then  e_free_FIPST_k  (yu)  = 
e_free_FIPST.  .k(y) .  and  u  plays  absolutely  no  part  in  the 
function  determination.  Yet  its  presence  in  the  items 
inhibits  the  compaction  of  the  many  different  items 
representing  references  to  the  nonterminal  symbol  A.  This 
results  in  a  large  number  of  item  sets,  and  thus,  tables. 


4.3.1:  The  Incremental  Approach 

4, 3. 1.1:  Constructing  LP(C)  Item  Sets 

We  use  the  incremental  approach  to  table  construction 
[ DeRemer  1969,1971],  which  is  the  most  frequently  used 
approach  in  practical  constructors.  Rather  than  compute  the 
canonical  LR(1)  tables  directly,  as  in  Knuth's  original 
formulation  of  the  process,  DeRemer 's  method  takes  advantage 
of  the  fact  that  most  parse  tables  require  no  lookahead  at 
all,  i.e.,  much  of  the  parsing  can  be  LR(0).  DeRemer’s 
algorithm  generates  the  collection  of  LR(0)  item  sets 
(termed  the  LR  (0)  collection) ,  and  then  augments  with 
lookahead  just  those  item  sets  requiring  it.  DeRemer 
defines  two  methods  of  augmenting  item  sets,  termed  the  SLR 
and  LAIR  approaches.  We  now  present  a  more  formal  treatment 
of  the  incremental  approach. 

An  LR  (0)  item  is  an  ordered  pair  (p,j),  where  p  and  j 
are  as  defined  for  an  LR  (k)  item  —  i.e., 

p  is  the  index  of  some  production  in  P;  and 

j  is  either  the  index  of  some  symbol  on  the  right  hand 
side  of  the  pth  production,  or  is  zero. 

Again  as  in  the  case  of  LR  (k)  items,  the  presence  of  an 
item  (p,j)  in  an  item  set  indicates  that  when  the  parser  is 
using  the  table  associated  with  this  item  set,  it  is 
possible  that 

1)  production  p  has  been  used  to  generate  a  portion  of 
the  input  seen  so  far;  and 

2)  we  have  already  recognized  the  first  j  symbols  of  the 
production . 

All  we  have  done  in  going  from  the  general  LR  (k)  items 
to  LR  (0)  items  is  completely  ignore  lookahead  strings.  But 
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since  they  help  cause  the  large  number  of  item  sets  in  the 
LF  (k)  item  set  collection,  we  may  be  saving  a  good  deal  of 
work. 

We  are  now  in  a  position  to  motivate  the  meaning  of  Ic  in 
LR(k),  and  to  explain  just  how  anything  could  be  parsed 
using  an  LR (0)  parser,  i.e.,  one  which  seemingly  never  looks 
at  the  input  string.  The  k  in  LE  (k)  refers  to  the  length  of 
the  string  of  input  symbols  necessary  to  uniquely  determine 
the  parsing  action  to  be  performed  at  each  parse  step. 

Thus,  each  string  in  the  domain  of  the  parsing  action 
function  f  must  be  of  length  k  or  less.  If  a  grammar  is 
LR  (0)  ,  this  reflects  the  fact  that  we  can  uniquely  determine 
which  parsing  action  to  perform  without  consulting  the  input 
string  for  help.  However,  our  goto  function  g  still  does 
consult  the  symbol  just  *read*.  Thus  for  LR(0)  grammars  g 
determines  whether  or  not  the  symbol  is  legal. 

We  present,  as  an  example,  the  item  sets  and  parse 
tables  for  the  arithmetic  expression  grammar  G: 

G  =  <Vn  =  {E,T,F}, 

Vt  =  {a, +  ,*,(,)}, 

E  €  Vn, 

P>,  where  P  is  the  set  of  productions 

(1)  E  ->  E+T 

(2) ^  E  ->  T 

(3)  T  ->  T*F 

(4)  T  F 

(5)  F  >>  (E) 

(6)  F  a 


Figure  4.3.1a 

An  Arithmetic  Expression  Grammar  G 


We  first  derive  the  augmented  grammar  G'  by  adding  the 
new  goal  symbol  S  and  a  new  production  S  ->  E.  We  then 
generate  the  LE(0)  item  sets,  as  shown  in  Figure  4.3.1b. 

The  parse  tables  constructed  from  the  item  sets  are  shown  in 
Figure  4.3.1c.  We  have  denoted  the  parsing  action  shift  by 
Sr  reduce  i  by  Ei,  for  1<i<6,  accept  by  A,  and  error  by  x. 
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0 : 

S  -> 

•  E 

5:  F  -> 

a^ 

E  -> 

•  E+T 

E  -> 

•  T 

6:  E  -> 

E+  •T 

T  -■> 

•  t*f 

T  -> 

•  T*F 

T  -> 

•  F 

T  -> 

•  F 

F  -> 

•  (E) 

F  -> 

•  (E) 

F  -> 

•  a 

F  -> 

•a 

1  : 

s  -> 

E» 

7:  T  -> 

T*^F 

E  -> 

E*+T 

F  -> 

•  (E) 

F  -> 

•a 

2: 

E  -> 

T* 

T  -> 

T«*F 

8:  F  -> 

(£•) 

E  -> 

E^+T 

3: 

T  -» 

F» 

9:  E  -> 

E+T» 

4: 

F  -> 

(•E) 

T  -> 

T^*F 

E  -> 

•  E+T 

E  -> 

•  T 

10:  T  -> 

t*f« 

T  -> 

•  t*f 

T  -> 

•  F 

11:  F  -> 

(E)* 

F 

•  (E) 

F  -> 

•  a 

Figure  4.3.1b 

The  LE  (0)  items  for  the  Augmented  Grammar  G* 
Derived  from  the  Arithmetic  Expression  Grammar  G 
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The  (Inadequate)  LE(0)  Tables 
for  the  Arithmetic  Expression  Grammar 
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Notice  that,  for  the  most  part,  LE  (0)  is  good  enough  for 
the  arithmetic  expression  grammar;  usually,  there  is  a 
unique  action  in  the  table.  However,  tables  1,2  and  9  of 
Figure  4.3.1c  are  exceptions;  LR{0)  is  inadequate,  in  that 
it  cannot  resolve  the  parsing  action  conflict.  To  be  able 
to  uniquely  determine  whether  or  not  to  shift  or  reduce,  we 
must  enable  f  to,  examine  the  next  input  symbol. 


4.3. 1.2:  Extending  LR(0) 

In  attempting  to  'correct*  the  parsing  action  function  f 
for  the  cases  where  lookahead  is  needed  to  uniquely 
determine  the  parsing  action,  we  redefine  the  domain  of  f  in 
terms  of  lookahead  strings  of  length  1,  i.e.,  as  if  we  were 
performing  the  LR(1)  computation.  We  associate  a  lookahead 
set  with  each  final  item  in  each  item  set.  This  set 
contains  all  the  symbols  that  may  follow  the  string  derived 
from  the  production  just  completed. 

We  will  discuss  below  two  methods  of  determining  this 
lookahead  set.  The  first  is  called  SLR,  Simple  LR,  because 
of  its  relatively  simple  rule:  we  use  the  FOLLOW  set  of  the 
nonterminal  symbol  defined  by  the  production  involved  in  the 
final  item  as  the  item's  lookahead  set.  The  second  method 
is  called  LAIR,  Lookahead  LR,  and  computes  the  exact  right 
context  of  the  final  item  via  a  recursive  procedure.  This 
lookahead  set  is  in  general  a  subset  of  the  FOLLOW  set  of 
the  nonterminal  defined  by  the  production  in  the  final  item. 

Once  we  have  computed  the  lookahead  sets  for  final 
items,  we  can  construct  the  parsing  tables  from  the  item 
sets.  This  time,  however,  we  have  k  =  1. 

We  should  note  that  the  resultant  tables  will  look  quite 
different  from  the  tables  produced  when  k  =  0  because  the 
error  detection  responsibility  has  shifted.  For  k  >  0  the 
f-function  has  the  responsibility  for  error  detection,  since 
it  examines  the  input  stream  to  decide  upon  a  parsing 
action.  When  k  =  0,  however,  the  f-function  does  not 
consult  the  input  stream,  and  the  g-function  has  the 
responsibility  for  error  detection. 

Assuming  that  f  is  now  defined  on  Vt  union  {e} ,  the 
strings  in  Vt*  of  length  less  than  or  equal  to  one,  we  can 
summarize  the  problem  of  LR(0)  conflicts  by  noting  that 
LR  (0)  reduce  actions  are  assumed  to  be  applicable  for  all 
lookahead  strings .  For  example,  in  Table  2  of  Figure 
4.3.1c,  the  reduce  2  action  is  assumed  valid  for  all 
terminal  symbols,  and  simultaneously  a  shift  action  is  to  be 
performed  when  the  input  symbol  is  '*'.  Thus,  we  have  a 
shift/reduce  conflict.  If  the  input  symbol  is  '*',  should 
the  parser  shift  or  reduce?  The  various  methods  of  conflict 
resolution  we  are  about  to  discuss  restrict  the  set  of 
symbols  for  which  the  reduce  action  is  valid.  We  first 
discuss  the  SLR  method  (Simple  LR) . 
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4.3.1. 2.1:  The  SLR(1)  Conflict  Resolution  Technique 

Suppose  the  parser  is  using  table  2  of  Figure  4.3. Ic, 
which  is  associated  with  item  set  2  of  Figure  4.3.1b.  The 
parser  must,  by  looking  at  the  next  input  symbol,  choose 
between  reducing  T  to  E  or  shifting.  If  the  input  symbol  is 
not  a  symbol  which  can  legally  follow  E  in  some  sentential 
form,  then  it  does  not  make  sense  to  reduce:  if  we  do  so,  we 
will  not  be  able  to  shift  past  the  next  input  symbol, 
because  it  can  not  follow  E.  This  is  the  essence  of  the  SLR 
approach.  Instead  of  assuming  that  a  reduce  action  is  to  be 
performed  on  the  set  of  all  terminal  symbols,  we  trim  this 
set  to  just  the  FOLLOW  set  of  the  nonterminal  symbol 
involved  in  the  reduction.  We  note  that  for  most 
programming  language  grammars  this  is  sufficient  [Anderson, 
Eve  and  Horning  1973],  [Horning  1974a]. 

Figure  4. 3. Id  shows  the  parse  tables  for  our  example 
grammar  G,  with  SLR  conflict  resolution  applied  to  each 
final  item.  We  should  remark  that  the  SLR  method  is  often 
applied  to  just  the  LR (C) -inadequate  item  sets,  rather  than 
to  all  of  the  item  sets. 
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Figure  4.3. Id 
The  Parsing  Tables 

for  the  Arithmetic  Expression  Grammar  G 
with  SLR  Conflict  Resolution 
Applied  to  All  Reduce  Actions 


Thus  the  SLR  conflict  resolution  technique  is  an 
improvement  over  the  LR(0)  assumption  on  the  applicability 
of  reduce  actions,  but  the  SLR  assumption  is  still  a 
simplification.  It  may  not  always  resolve  the  conflict. 

The  algorithm  performing  the  SLR  conflict  resolution  attempt 
uses  the  FOLLOW  set  of  each  nonterminal  symbol,  as  computed 
by  the  grammatical  analysis  phase.  For  each  inadequate  item 
set  the  algorithm  compares  the  FOLLOW  set  of  each 
nonterminal  symbol  involved  in  a  reduction  with  the  set  of 
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r+  m 


transition  symbols  in  the  item  set  (which  includes  the 
FOLLOW  sets  of  all  other  nonterminal  symbols  involved  in  a 
reduce  action  in  this  item  set) •  If  all  such  sets  are 
disjoints  the  conflict  has  been  resolved. 


We  sh 
possible 
than  the 
ymbol  in 
he  assum 
use  of  a 
from  the 
another  [ 
following 


ould  also  point  out  that  by  using  the  full  set  of 
right  cpntexts  of  the  nonterminal  symbol  rather 
(restricted)  right  context  of  the  nonterminal 
a  particular  item  set,  the  SLR(1)  approach  makes 
ption  that  the  set  of  input  symbols  signalling  the 
nonterminal  symbol  A  in  one  context  is  disjoint 
set  of  input  symbols  signalling  the  non-use  of  A  in 
Anderson,  Eve  and  Horning  1973].  Consider  the 
grammar  and  its  LR(0)  item  sets: 


G  =  <Vn  =  {S,A},  Vt  =  {a,b,c,d},  S,  P>,  where  P  is 

S  ->  aAb 
I  bAd 
I  acd 


A  ->  c 


0  : 

S  ' 

->  •s 

1: 

S' 

2: 

S  -> 

a  •  Ab 

s 

->  •aAb 

S  -> 

a*cd 

s 

->  "acd 

A  -> 

*c 

s 

‘bAd 

3  : 

s 

->  b»Ad 

4: 

s 

->  aA»b 

5: 

S  -> 

ac*d 

A 

->  ^c 

% 

A  -> 

c* 

6: 

s 

->  bA«d 

7: 

A 

->  c» 

8: 

s  -> 

aAb* 

9: 

s 

->  acd* 

10: 

s 

->  bAd* 

Figure  4. 3 . 1 e 

A  LALR(1)  grammar  which  is  not  SLR(1) 


Here,  d  signals  the  use  of  nonterminal  symbol  A  in  bAd, 
and  the  non-use  of  A  in  acd.  The  discrepancy  is  noted  in 
item  set  5.  We  may  resolve  the  SLR(1)  conflict  by  using 
left  context  to  determine  that  in  item  set  5  d  can  not 
follow  A;  thus  the  exact  .right  context  of  A  in  the  item  set 
is  {b} ,  rather  than  £b,d}.  Thus  this  grammar  is  not  SLR(1); 
however,  it  is  LALR(1),  as  we  shall  see  in  the  next  section. 
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4.3. 1,2.2:  The  LALE(1)  Conflict  Pesolution  Technique 

The  LALR(1)  method  determines  those  symbols  which  may 
fellow  the  nonterminal  symbol  involved  in  a  reduction  in  the 
context  of  a  particular  item  set.  This  set  of  symbols  is  a 
subset  of  the  FOLLOW  set  of  the  nonterminal.  For  example, 
in  Figure  4, 3.1e  the  LALR  method  would  determine  that  only  b 
could  follow  A  after  the  reduction  in  item  set  5,  Thus,  if 
the  next  input  symbol  the  parser  'examines  is  b,  an 
appropriate  parsing  action  would  be  reduce:  if  the  next 
input  symbol  is  not  b,  the  parser  will  not  perform  a  reduce 
action.  Recapitulating,  we  may  say  that  the  LALR(1)  method 
determines  the  exact  right  context  of  a  nonterminal  symbol 
involved  in  a  reduction  in  a  particular  item  set. 

The  technique  employed  by  the  LALR  algorithm  uses  a  goto 
graph  derived  from  either  the  LR(0)  collection  of  item  sets, 
or  from  the  g-functions  of  the  LR  (C)  tables  of  the  grammar. 
This  goto  graph  is  really  just  a  finite  state  machine  of  the 
productions  of  the  grammar.  DeRemer  uses  a  very  similar 
concept,  the  characterist ic  finite  state  machine  of  the 
grammar  [DeRemer  1971], 

The  graph  is  composed  of  named  nodes,  which  correspond 
to  item  sets  or  tables,  and  labelled  directed  edges,  where 
the  label  is  a  grammar  symbol.  There  is  an  edge  labelled 
with  X  from  node  i  to  node  j  whenever  gi  (X)  =  j.  This  means 
that  item  set  i  contains  an  item  [A  ->  y«Xz],  and  item  set  j 
contains  an  item  [ A  yX«z],  where  a  6  Vn,  y,z  €  V*,  and 
X  €  V. 

As  an  example  of  a  goto  graph,  we  portray  the  graph  of 
the  grammar  of  Figure  4.3. 1e: 
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r - >(1) 
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I  a  A  b 

(0) - >(2) - >(4) - >(8) 

I  i 
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1  •- - — >(5) - >(9) 
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I  b  A  d 

L - >(3) - >(6) - >(10) 
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I  c 

L - >(7) 


Figure  4. 3. If 

The  Goto  Graph  for  the  Grammar  of  Figure  4.3, 1e 
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We  must  next  extend  the  definition  of  our  transition 
function  T  from  single  symbols  to  strings  of  symbols,  as 
follows: 

T(C,e)  =  C 

T  (C,X2)  =  T(T  (C,X)  ,Z) 

We  can  also  modify  T  to  return  the  closed  item  set, 
rather  than  just  the  basis  set;  then  T  is  equivalent  to  the 
GOTO  function  of  Aho  and  Oilman.  Essentially,  T  (or  GOTO) 
walks  through  the  goto  graph,  starting  at  the  node 
associated  with  item  set  C,  and  following  a  path  spelled  out 
by  the  symbols  of  the  string  Xz. 

We  can  now  define  the  predecessor  function  with  two 
parameters  (a  closed  item  set  and  a  string)  as 

T-MC,Z)  =  {C»  I  C  =  T(C*,Z)}, 

Then  the  exact  right  context  function  R  is  as  follows: 

S(C,[A  ->  x*y])  =  (a  €  (Vt  unioii  (e))  I  C»  6  T-i  (C,x)  , 

[ B  w»Az  ]  €  C* , 
a  e  FIRST  (zv) ,  and 
V  e  R  (C',[  B  w»Az  ])  } 

This  definition  provides  us  with  an  algorithm  if,  whenever  a 
computation  of  R(C,[A  ->  x*y])  invokes  itself,  the  latter 
invocation  simply  returns  {e} .  This  precaution  must  be 
taken  because  a  goto  graph  may  contain  cycles. 

The  above  method  is  very  similar  to  that  suggested  by 
Anderson  and  used  by  LaLonde  [LaLonde  1971].  Adams  [Adams 
and  White  1975]  gives  a  definition  of  the  exact  right 
context  function  for  LALR  (k) ,  where  k  >  0.  Our  definition 
differs  slightly,  in  that  we  have  used  T-i  rather  than  the 
PRED  function  used  by  Anderson,  LaLonde,  and  Adams.  This 
latter  function  can  be  defined  to  be 

PRED(C,len)  =  {C*  I  C  =  T(C',x),  and  |x|  =  len)  . 

In  general,  T~i  is  a  subset  of  PRED. 

The  parser,  incidentally,  is  not  dependent  on  the  fact 
that  we  happen  to  construct  LALE(1)  tables.  The  same 
parsing  algorithm  will  work  for  LR  (0)  ,  SLR(1),  LALR(1),  and 
LR(1)  tables.  Thus  the  differences  between  SLR,  LALR,  and 
LE(1)  are  not  noticed  in  the  parsing  algorithm,  but  rather 
in  the  amount  of  work  the  construct or  is  forced  to  perform. 
The  grammar  class  also  has  some  effect  on  the  size  of  the 
parse  tables  [Purdom  1974]. 


4.3.2:  Improving  the  Completion  Operation 

An  interesting  facet  of  the  closure  computation  is  the 
fact  that  it  is  often  repeated  many  times  for  the  same 
nonterminal  symbol  —  once  for  each  reference  to  the  symbol. 
When  k  =  0,  each  computation  generates  exactly  the  same  set 
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augmenting  inadeguate  tables  with  lookahead,  Anderson 
directly  calculates  the  SLR(1)  tables,  Anderson  combines 
Aho  and  Oilman's  f  and  g  functions  for  terminal  symbols  by 
noting  that  for  a  €  Vt,  g (a)  #  error  if  and  only  if 
f  (a)  =  shift.  Aho  and  Oilman  [Aho  and  Oilman  1973b]  provide 
a  very  readable  formulation  of  SLP(1)  table  construction. 

We  mentioned  the  incremental  approach  to  LALF(1); 
Anderson,  Pager  [Pager  1972],  and  Aho  and  Oilman  [Aho  and 
Oilman  1973a, b]  all  discuss  the  decremental  approach  to 
LALR,  which  basically  consists  of  generating  the  LR(1) 
tables  and  performing  some  method  of  partition  compaction. 
The  interested  reader  is  referred  to  the  above  references, 
especially  [Aho  and  Oilman  1973b].  Brosgol's  ALR(1)  method 
(Articulated  LR:  [Brosgol  1975])  accepts  a  class  of  grammars 
between  IA1R(1)  and  LR(1),  which  seems  to  be  an  advantage 
while  debugging  a  grammar,  since  it  is  a  simple  matter  to 
make  an  LALR(1)  grammar  non-LALR(l)  by  the  addition  of  a  few 
productions.  However,  his  techniques  produce  larger  tables 
than  do  LALR(1)  techniques,  LARK,  the  constructor  system 
described  in  [Adams  and  White  1975],  does  not  stop  at 
LALR(1)  as  ours  does;  LARK  splits  item  sets  of  the  LALR(1) 
collection  in  an  attempt  to  resolve  LALR(1)  conflicts. 

And  finally,  we  note  that  the  incremental  approach  to 
LALR  tables  we  are  using  is  a  special  case  of  Anderson's 
LA(k)LR(m)  grammars:  those  grammars  deterministically 
parsable  in  a  bottom-up  manner,  using  LR(m)  parse  tables, 
augmented  with  k-symbol  lookahead  at  some  points,  where 
k  >  m.  In  other  words,  LALR(1)  is  equivalent  to  LA(1)LR(0). 
We  refer  the  reader  to  [Anderson  1972]  and  [LaLonde  1976] 
for  a  considerably  more  detailed  discussion. 


4.3.5:  The  Class  of  LL(1)  Grammars  is  a  Subset  of  the  Class 
of  LALR  (1)  Grammars 

In  this  section  we  demonstrate  that  the  class  of  LL(1) 
grammars  is  contained  in  the  class  of  LALR(1)  grammars. 

This  theorem  is  not  new,  but  the  proof  is  noteworthy  in  that 
it  argues  in  terms  of  the  item  set  collection  of  the  LL(1) 
grammar. 

A  grammar  is  LL  (1 )  if  it  satisfies  the  following  three 
conditions  [Knuth  1971]; 

For  each  set  of  productions  A  ->  w1  |  w2  |  ...  1  wn, 
with  n  >  0, 

(1)  FIRST  (wi)  intersect  FIRST  (wj)  =  0,  for  1<i<n,  1<j<n, 
i  #  j. 

(2)  At  most  one  of  the  wi's  produces  the  null  string  e. 

(3)  If  wi  e,  then  FOLLOW  (A)  intersect  FIRST  (wj)  =  0, 

for  1<j<n,  j  #  i. 
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Lemma  4.3. 5.1  shows  that  the  item  sets  in  the  LR(0) 
collection  constructed  from  an  LL(1)  grammar  are  of  a 
certain  restricted  form  —  namely  that  each  basis  set  has 
only  one  item. 


Lemma  4. 3 . 5. 1 : 

Each  nonempty  item  set  in  the  LE(0)  collection 
constructed  from  an  LL(1)  grammar  has  a  basis  set  containing 
exactly  one  element. 

Proof : 

Informally,  we  will  conclude  that  the  occurrence  of  more 
than  one  element  in  a  basis  set  of  an  item  set  implies  that 
at  the  point  in  the  parse  represented  by  the  item  set,  we 
have  partially  recognized  more  than  one  production.  The 
LL(1)  condition  prohibits  this;  we  must  be  able  to  determine 
which  production  applies  as  soon  as  we  apply  it.  The  proof 
is  by  induction. 

(1)  basis  of  induction; 

The  basis  set  of  the  initial  item  set  of  the  LR(0) 
machine  is  {[  S  *  ->  •S']},  by  definition  of  the  constructor, 
and  certainly  contains  exactly  one  element. 

(2)  induction  step; 

Suppose  that  an  item  set  has  a  basis  set  containing 
exactly  one  element  [  A  ->  x»y],  with  A  €  Vn  and  x,y  €  V*. 

We  wish  to  prove  that  the  basis  sets  of  all  nonempty 
successor  sets  of  the  item  set  have  exactly  one  element.  We 
may  delineate  three  cases; 

(i)  y  =  e;  the  closure  operation  adds  no  new  elements 
to  the  item  set.  The  successor  set  of  the  single  item 

[ A  ->  x» ]  is  empty,  by  definition, 

(ii)  The  transition  symbol  of  [A  x*y]  is  a  terminal; 
y  =  az,  where  a  €  Vt  and  z  €  V*.  The  basis  set  of  the 
single  successor  item  set  of  S  is  £[  A  xa«z]},  by 
definition  of  the  LR(0)  constructor  algorithm.  This  basis 
set  has  only  one  item, 

(iii)  The  transition  symbol  of  [A  ->  x»y]  is  a 

nonterminal:  y  =  Bz,  where  B  €  Vn,  and  z  €  V*.  Before  we 

can  compute  the  successor  sets  of  the  item  set,  we  must 
first  compute  the  item  set*s  closure.  We  first  add  the 
items  [  B  ->  •v},  for  each  defining  production  (B,v)  €  P. 

Each  of  the  nonempty  productions  must  begin  with  a  different 
symbol,  since  by  LL(1)  condition  (1)  the  FIRST  sets  of  all 
alternatives  of  a  nonterminal  must  be  disjoint.  Also,  any 
further  items  added  in  the  closure  computation  will  have 
different  transition  symbols  from  any  previously  added 
items,  again  because  LL(1)  condition  (1)  ensures  the 
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disjointness  of  the  FIRST  sets.  Therefore,  the  transition 
sets  for  each  of  these  transition  symbols  will  all  generate 
separate  basis  sets,  each  with  exactly  one  element.  All 
items  added  due  to  empty  productions  have  empty  transition 
sets. 

Thus,  each  nonempty  item  set  in  the  LR(0)  collection 
constructed  from  an  LL(1)  grammar  has  a  basis  set  containing 
exactly  one  element. 

Q  •  £•  D . 


Lemma  4.3. 5.2  shows  that  LR(0)  conflicts  in  an  LL(1) 
grammar  can  only  be  caused  by  empty  productions. 


Lemma  4. 3. 5. 2: 

Any  LE(0)  conflicts  in  the  LR  (0)  collection  constructed 
from  an  LL(1)  grammar  are  caused  by  the  existence  of  empty 
items  —  that  is,  items  of  the  form  [  A  ->  •  ] 

Proof : 

An  LR  (0)  conflict  is  caused  when  an  item  set  contains 
both  final  items  and  intermediate  items.  In  an  LL(1) 
grammar,  any  final  item  formed  during  the  LR(0)  machine 
construction  from  a  production  with  a  nonempty  right  hand 
side  is  in  an  item  set  by  itself,  because  the  basis  set  of 
each  nonempty  item  set  has  exactly  one  element  (by  the  above 
Lemma)  and  the  closure  operation  will  not  add  any  others. 
Thus,  there  can  be  no  conflict.  The  only  other  type  of 
final  item  is  the  type  introduced  during  closure  due  to  some 
empty  production.  If  there  are  any  conflicts,  they  must  be 
due  to  this  culprit. 

Q .  E.  D , 


Corollary  4.3. 5. 3: 

Any  LL(1)  grammar  with  no  empty  productions  is  an  LR(C) 
grammar. 

Proof : 

If  empty  productions  are  the  unique  cause  of  LR(0) 
conflicts,  then  an  LL(1)  grammar  with  no  empty  productions 
cannot  have  LR  (0)  conflicts,  and  is  therefore  an  LR(0) 
grammar. 

Q  «  E  •  D  . 


In  proving  that  the  class  of  LL(1)  grammars  is  a  subset 
of  the  class  of  LALR(1)  grammars,  we  will  implicitly  use  the 
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following  definition  of  LA1R(1):  a  grammar  is  LALR(1)  if 
there  are  no  inadequate  item  sets  in  the  LR(0)  collection  of 
item  sets  when  this  collection  is  augmented  with  LALR (1) 
lookahead.  This  means  that  in  at  least  each  1R(0)- 
inadeguate  item  set  I,  we  use  the  exact  right  context 
function  of  the  empty  item  producing  the  LR(0)  conflict  to 
resolve  the  conflict.  In  particular,  the  empty  item 
[  A  •!  is  replaced  with  the  items  [A  •,R(I,[A  •  ])  ], 

In  the  following  Theorem,  we  use  the  symbol  *+’  to  denote 
the  set  union  operation,  and  the  symbol  *n*  to  denote  the 
set  intersection  operation. 


Theorem  4, 3. 5. 4; 

The  class  of  LL(1)  grammars  is  a  subset  of  the  class  of 
LAIR(1)  grammars. 

Proof : 

Lemma  4. 3. 5.1  demonstrated  that  all  transition  symbols 
are  distinct  in  each  item  set.  Lemma  4.3. 5.2  showed  that 
any  LR  (0)  conflicts  are  caused  by  empty  items.  We  note  that 
all  that  remains  to  be  shown  is  that  in  each  LR(0)- 
inadequate  item  set  I  of  the  LR(0)  collection  generated  from 
an  LL(1)  grammar,  the  exact  right  context  of  each  empty  item 
must  be  disjoint  from  any  transition  symbols  or  right 
contexts  of  other  empty  items  in  I. 

The  proof  is  divided  into  two  main  sections.  The  first 
part  demonstrates  that  the  right  context  of  an  arbitrary 
empty  item  must  be  disjoint  from  the  transition  symbols  of 
all  other  items.  The  proof  of  this  first  section  is 
essentially  an  induction  on  the  number  of  closure  operations 
needed  before  the  empty  item  was  added  to  the  item  set.  The 
second  section  then  demonstrates  that  the  right  contexts  of 
any  two  empty  items  in  a  particular  item  set  must  be 
disjoint. 

Let  I  be  an  inadequate  item  set.  Then  I  contains  an 
empty  item  of  the  form  [  A  ->  • ],  We  know  that  this  item  was 
added  during  the  closure  of  I,  along  with  items  formed  from 
all  other  defining  productions  of  A.  Let  the  defining 
productions  of  A  be 

A  ->  e  I  a2  I  ...  (an  for  n>1,  with  ai  €  V*  for  1<i<n. 
Then  I  contains  the  items  [ A  ->  •a2],  ...,  [A  ->  •an],  and 
also  [3  ->  x*Ay],  where  this  last  item  is  the  one  which 
caused  the  items  dealing  with  A  to  be  added  to  I  during  the 
closure  computation.  We  first  show  that  the  empty  item  is 
not  in  conflict  with  the  items  formed  from  other  defining 
productions  of  A. 

Since  A  =>*  e,  LL(1)  condition  (3)  assures  us  that 
FCLLOW(A)  n  FIRST  (ai)  =  0  for  1  <  i  <  n.  Because 
R(I,[A  •  ])  is  a  subset  of  FOLLOW  (A)  ,  we  also  have 

R(I,iA  ->  •  ])  □  FIRST  (ai)  =  0,  for  1  <  i  <  n.  Thus  there 
are  no  conflicts  with  the  items  [A  **>«ai],  for  1  <  i  <  n. 
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We  next  ‘move  up'  a  level  in  the  item  set,  and  examine 
the  item  that  caused  the  empty  item  to  be  added  to  I.  Let 
us  suppose  that  [ B  is  not  in  the  basis  set  of  I, 

Then  X  =  e,  and  the  production  from  which  the  item  was 
formed  is  B  Ay.  This  item  was  also  added  during  closure, 
along  with  other  items  formed  from  defining  productions  of 
B,  due  to  an  item  in  I  of  the  form  [C  ->  w«Bz],  with  C  6  Vn, 
and  w,z  €  V*.  Let  the  defining  productions  of  B  be 

B  Ay  I  b2  I  ...  j  bm  for  m>1,  with  bi  €  V*  for  1<i<m. 
We  know  by  LL(1)  condition  (1)  that 

FIRST  (Ay)  n  FIRST  (bi)  =  0,  for  1  <  i  <  m  (1) 

If  y  cannot  derive  e,  then  R(I,[A  ->  •])  is  a  subset  of 
FIRST  (Ay)  ,  and  we  have 

R(I,[A  ->  •])  n  FIRST  (bi)  =  for  1  <  i  <  m. 

If,  however,  y  =>*  e,  then  since  B  ->  Ay  is  a  production,  we 
have  B  =>*  e;  LL(1)  condition  (3)  tells  us  in  this  case  that 
FOLLOW  (B)  D  FIRST  (bi)  =  0  for  1  <  i  <  m.  We  can  combine 
this  fact  with  point  (1)  above,  to  conclude 

(FIRST  (Ay)  4  FOLLOW (B))  u  FIRST  (bi)  =  0,  for  1<i<m. 

We  know  that  R(I,[A  •])  is  a  subset  of 
FIRST  (Ay)  4  FOLLOW (B)  ;  therefore  we  can  conclude 
R(I,[A  ->  •])  n  FIRST  (bi)  =  0,  for  1  <  i  <  m. 

Thus  the  empty  item  is  not  in  conflict  with  any  of  the 
items  formed  from  the  defining  productions  of  B. 

We  can  take  this  process  one  more  step,  and  examine  the 
item  [C  ->  w*Bz].  If  fC  ->  •Bz]  is  not  in  the  basis  set  of 
I,  then  we  can  repeat  the  above  argument,  and  demonstrate 
that  the  empty  item  is  not  in  conflict  with  items  formed 
from  defining  productions  of  C: 

R(I,[A  •])  n  FIRST  (ci)  =  0,  for  1<i<q 
where  the  defining  productions  of  C  are 

C  -4  Ez  I  c2  I  ...  (  eg  for  q>1,  with  ci  6  V*. 

At  some  point,  however,  we  will  finally  reach  a  point 
where  the  item  [ C  w*Bz]  is  in  the  basis  set  of  the  item 
set  I  (if  w  =  e,  then  C  =  S'^  and  Bz  =  S)  .  Lemma  4.3.5. 1 
tells  us  that  [ C  -4  w*Ez]  is  the  only  member  of  the  basis 
set,  and  we  already  know  that  R(I,[A  ->  •])  n  FIRST  (bi)  =  0, 
for  1<i<m,  where  m  is  the  number  of  defining  productions  of 
E. 


Thus  the  right  context  of  the  empty  item  is  disjoint 
from  the  FIRST  sets  of  all  alternatives  of  B  except  the  one 
leading  to  the  empty  item's  inclusion  in  I. 

The  problem  where  two  empty  items  may  conflict  is 
handled  by  using  LL(1)  condition  (2) ;  at  some  point,  we  will 
have  nonterminal  B  whose  defining  productions 

B  ->  b 1  I  b2  I  ...  j  bm 

are  such  that  the  item  [  B  -4  •bi ]  leads  to  the  inclusion  of 
the  empty  item  [A1  -4  •],  and  the  item  [ B  -4  •b2]  leads  to 
the  inclusion  of  the  empty  item  [ A2  -4  •].  By  the  arguments 
used  above,  we  can  conclude  the  following: 

R(I,[A1  -4  •  ])  is  a  subset  of  FIRST  (bi)  (2) 

or,  if  bi  =4*  e,  then 
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R(I,[A1  ->  •  ])  is  a  subset  of  FIRST  (b1)  +  FOLLOW  (B) 
Sirailarlyr 

P(I,[A2  •  ])  is  a  subset  of  FIRST  (b2) 

or,  if  b2  =>*  e,  then 

R(I,[A2  ->  •  ])  is  a  subset  of  FIRST  (b2)  +  FOLLOW  (B) 
Invoking  LL(1)  condition  (1),  we  have 

FIRST  (b1)  D  .  FIRST  (b2)  =  0 
By  LL(1)  condition  (2),  only  one  of  b1,b2  =>*  e. 

If  b1  =>*  e,  then  FOLLOW(E)  □  FIRST  (b2)  =  0,  and  we  use 
points  (3),  (4),  and  (6)  to  conclude 

P(I,CA1  •])  °  R(I,[A2  •])  =  0 

The  case  b2  =>*  e  is  handled  similarly.  If  neither 
nor  b2  derive  the  empty  string  e,  then  we  can  use  points 
(2),  (4),  and  (6)  to  conclude  point  (7). 


(3) 

(4) 

(5) 

(6) 


(7) 

b1 


We  have  thus  shown  that  each  empty  item  has  an  exact 
right  context  which  is  disjoint  from  the  set  of  transition 
symbols  of  the  item  set,  as  well  as  from  the  right  contexts 
of  other  empty  items  in  the  same  item  set. 

Thus  LALR(1)  techniques  resolve  the  inadequacy  of  I. 
Therefore  the  grammar  is  LALR(1),  and  we  have;  the  class  of 
LL(1)  grammars  is  a  subset  of  the  class  of  LALR(1)  grammars. 


Q.  E.  D. 


What  we  would  like  to  do  next  is  prove  that  the  SLR  (1 ) 
resolution  technique  is  sufficient  to  resolve  any  LR (0) 
conflicts;  then  all  IL(1)  grammars  would  be  SLR(1)  grammars. 
In  fact,  [  Aho  and  Ullman  1973b j  claim  this  to  be  true;  they 
leave  it  as  an  exercise  to  chapter  7.  But  the  FOLLOW  sets 
used  in  the  SLR(1)  resolution  technique  may  contain  symbols 
quite  irrelevant  to  the  current  item  set,  that  nevertheless 
cause  the  SLR  (1)  technique  to  be  inadequate.  The  problem  is 
demonstrated  by  the  following  example: 

G  =  <Vn  =  {S,  A,  B}  , 

Vt  =  {X,  y,  z]  , 

S  -<  Vn, 

P>,  where  P  is  the  set  of  productions 

S  A  B  X 

A  ->  E  y 

I  X 

E  ->  z 

)  e 

We  first  demonstrate  that  G  is  LL(1),  by  showing  it  fulfills 
the  LL(1)  conditions: 

i)  with  respect  to  nonterminal  A: 

FIRST (By)  =  {z,y}  is  disjoint  from  FIRST (x); 

ii)  and  with  respect  to  B: 

FIRST  (z)  is  disjoint  from  FIRST  (e) ;  and 
2)  FOLLOW (B)  =  {x,y}  is  disjoint  from  FIRST  (z). 
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But  G  is  not  SLR(1),  because  input  symbol  x  signals  two 
distinct  actions  in  item  set  Or  as  demonstrated  below; 


0:  [S  -A  B  X], 
[ A  ->  -B  y ]r 
[A  ->  -x], 

[  B  ->  •Zlr 
[B  ->  •], 


on  Ar  goto  1 
on  B,  goto  2 
on  X,  goto  3 
on  Zr  goto  4 
on  {Xryjr  reduce  by 


production  4 


The  problem  arises  because  B  can  be  followed  by  an  x. 

The  SLR(1)  technique  ignores  left  context,  whereas  the  LL(1) 
technique  uses  it:  only  the  local  follow  set  is  relevant 
here.  This  is  exactly  what  the  LALR(1)  technique  computes. 


Corollary  4.3. 5. 5: 

Every  LL(1)  language  is  an  IR  (1)  language. 

Proof : 

Immediate  from  the  theorem  above  and  the  fact  that  every 
LALR(1)  language  is  an  IE  (1)  language. 

Q  .  E.  D  . 


4.4:  Semantic  Actions 

One  may  logically  take  the  view  that  semantic  actions 
are  associated  with  points  in  a  production,  rather  than  with 
symbols  in  the  production.  For  reasons  of  efficiency, 
however,  we  associate  semantic  actions  in  each  production 
with  the  symbol  immediately  to  the  right  of  the  actions 
whenever  possible. 

In  this  section  we  will  first  describe  the  methods  of 
association.  This  will  lead  us  into  a  brief  discussion  of 
propagation  [Gordon  1975].  After  that,  we  will  discuss  just 
where  in  a  production  our  methods  allow  us  to  place  distinct 
semantic  actions.  We  will  prove  that  our  system  can 
correctly  process  any  LL(1)  grammar  with  distinct  semantic 
actions  associated  with  arbitrary  points  in  productions.  We 
finish  the  section  by  summarizing  our  rules  for  placing 
semantic  action  symbols  in  productions  and  by  providing  two 
examples. 
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4.4.1;  Association  of  Semantic  Action  Symbols  with  Other 
Grammar  Symbols 

Consider  the  alternatives  in  associating  semantic 
actions  with  other  symbols  in  a  production: 

(1)  no  association; 

(2)  right  association; 

(3)  left  association;  or 

(4)  some  combination  of  left  and  right  association. 

Conceptually,  semantic  actions  to  the  right  of  a  grammar 
symbol  (whether  nonterminal  or  terminal)  are  performed  after 
recognition  of  that  symbol  and  before  recognition  of  the 
next  symbol  to  the  right  of  the  semantic  actions.  For 
example,  in  the  production  Z  X  II  12  Y,  where  x,Y  €  V  and 
11,12  €  Va,  the  actions  II  and  12  are  performed  between  the 
recognition  of  X  and  that  of  Y,  respectively.  Thus,  it 
seems  at  first  glance  that  there  need  not  be  any  association 
of  semantic  actions  with  other  grammar  symbols.  They  stand 
on  their  own. 

Unfortunately,  actions  not  associated  with  a  particular 
symbol  in  a  particular  production  often  cause  extra  tables 
to  be  generated  in  the  construction  process  and  extra 
parsing  steps  to  be  performed  by  the  parser.  In  addition, 
lookahead  is  often  needed  to  determine  whether  or  not  an 
action  should  be  performed.  let  us  study  the  effects  of 
association  of  semantic  actions  with  other  grammar  symbols. 
If  semantic  action  sequences  are  associated  with  a 
particular  symbol  in  a  particular  production,  then  when  this 
symbol  is  recognized  in  the  context  of  this  production,  the 
actions  are  performed.  The  parser  does  not  act  upon  each 
action  symbol  as  a  separate  symbol  requiring  a  table  and  a 
parse  step.  As  a  result,  both  time  and  space  are  saved. 

First,  suppose  that  all  actions  are  right-associated. 

By  this  we  mean  that  all  actions  are  to  the  right  of  the 
symbol  to  which  they  are  associated.  This  fits  the  usual 
interpretation  of  association;  once  the  associated  symbol 
has  been  recognized,  the  actions  will  be  performed.  The 
action  symbols  are  demoted  relative  to  the  other  types  of 
grammar  symbols.  If  a  graph  of  a  production  right-hand-side 
were  to  be  drawn,  the  action  symbols  would  become  part  of 
other  grammar-symbol  nodes,  rather  than  nodes  themselves. 

There  are  several  problems  with  right-association, 
however.  First,  it  does  not  cover  all  the  cases.  When 
action  symbols  come  first  in  the  production  right-hand- 
sides,  there  is  no  symbol  to  which  the  actions  may  be  right- 
associated. 

Secondly,  there  are  cases  when  a  semantic  action  in  the 
middle  of  a  production  simply  cannot  be  right-associated. 
Consider  the  following  form  of  the  Boolean  selection 
construct : 
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if  <Bool_exp> 

then  <stmt_list> 
else  <stint_list> 

end 

We  wish  to  exdent  the  'end*  to  the  indentation  level  of  the 
'if.  The  following  suffices: 

if_stmt  =  'if*  boolean_expression  #indent  #new_line  'then* 
tindent  (  statement  #new_line  ) +  texdent 
(  'else*  tindent  (  statement  #new_lin€  )+  texdent  )? 
texdent  'end*  ; 

where,  following  WEE  notation,  action  symbols  begin  with 
't*,  terminals  are  quoted  strings,  and  nonterminals  are 
identifiers.  There  is  no  way  we  can  know  to  perform  the 
final  texdent  without  recognizing  the  symbol  after  the 
texdent,  namely  the  'end*.  Thus,  this  final  texdent  cannot 
be  right-associated. 

Thirdly,  needless  conflicts  arise.  Consider  the 
following  two  productions: 

A1  ->  X  Y  Z  II  12  W 
A2  X  Y  Z  13  V 

If  actions  II  and  12  are  right-associated  to  symbol  Z  in  the 
production  defining  A1,  and  13  is  right-associated  to  Z  in 
the  production  defining  A2,  then  when  the  parser  recognizes 
the  Z  of  each  production  in  some  item  set  with  basis  set 
{fAI  ->  X  Y*Z  II  12  W],  [A2  ->  X  Y«Z  13  V]},  a  semantic 
action  conflict  arises  due  to  the  different  action 
sequences;  we  must  introduce  lookahead  to  resolve  it. 

Now  let  us  consider  left-association.  When  the 
associated  grammar  symbol  is  a  terminal,  definition  of  left- 
association  is  easy:  semantic  action  sequences  are  to  the 
left  of  the  terminal  symbol.  When  the  terminal  symbol  is 
recognized,  the  action  sequence  is  performed  before  stacking 
(and  printing)  the  terminal  symbol.  We  defer  for  the  moment 
a  discussion  of  the  technicalities  of  left-association  to 
nonterminal  symbols. 

Left-association  cures  some  of  the  ills  of  right- 
association,  If  we  suppose  that  W  and  V  are  different 
terminal  symbols  in  the  second  example  given  above,  left- 
association  rids  us  of  that  possible  lookahead.  If  the 
#exdent  of  the  above  selectional  construct  definition  is 
left-associated  with  the  'end*,  no  problem  (and  no  extra 
lookahead)  exists.  Conflicts  are  still  possible,  however, 
as  demonstrated  by  the  basis  set  {[  A1  ->  X  Y  Z*I1  12  W  V], 

[  A2  ->  X  Y  Z«I3  W  U]},  But  it  is  clear  that  any  conflicts 
in  left-associated  actions  will  still  be  conflicts  if  the 
actions  were  right-associated,  while  left-association  often 
has  no  conflicts  where  right-association  does. 
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But  there  are  still  problems.  Action  sequences 
appearing  at  the  end  of  a  production  cannot  be  left- 
associated  with  symbols  in  the  production.  And  we  have  a 
» Catch-22*  situation  when  left-associating  actions  with 
nonterminal  symbolst  we  must  be  sure  the  nonterminal 
applies  before  performing  the  actions,  yet  we  must  perform 
the  actions  before  recogni2:ing  the  nonterminal  since  the  act 
of  recognition  itself  may  involve  some  further  action 
sequences.  We  are  forced  to  look  ahead  in  an  attempt  to 
determine  the  applicability  of  the  actions  at  this  point. 

We  are  also  limited  to  one  symbol  lookahead  on  the  input 
stream,  thus  limiting  our  action  conflict  resolution  powers 
to  those  cases  resolvable  by  examination  of  the  FIRST  sets 
of  each  nonterminal.  For  example,  we  can  uniquely  determine 
which  semantic  actions  to  perform  for  the  basis  set 
{[A1  ->  X  Y  Z<»11  12  W],  [A2  ->  X  Y  Z*I3  V]}  only  if  FIRST  (W) 
and  FIRST  (V)  are  disjoint  (assuming  neither  W  nor  V  is 
nullable) . 


4.4.2:  Propagation 

Rather  than  have  the  parser  look  ahead  in  the  input 
stream  and  determine  which  symbol  we  are  about  to  begin 
recognizing,  the  constructor  can  propagate  the  actions  left- 
associated  with  the  nonterminal  symbol  into  the  defining 
productions  for  the  symbol,  to  the  left  of  the  first  symbol 
in  each  production  [Gordon  1975].  This  propagation 
continues  until  all  actions  have  been  propagated  to  the  left 
of  terminal  symbols,  which  we  know  how  to  handle:  the 
recognition  and  execution  process  is  quite  clear. 

The  problems  with  this  method  are  twofold.  First,  we 
get  many  copies  of  the  actions.  This  is  easily  mended  by 
using  pointers  to  action  sequences.  The  other  problem  is 
not  so  much  a  problem  as  an  unfortunate  consequence:  by 
propagating  action  sequences  to  the  left  of  terminal  symbols 
already  having  a  left-associated  action  sequence,  we  may 
have  discovered  another  type  of  action  conflict,  an 
inconsistent  action  sequence.  For  example,  the  items 
[A1  ->  *11  A2],  [A2  ->  -AS],  [A3  ->  •12]  result  in 
propagation  of  II  to  the  left  of  12  in  the  third  production; 
if  11  =  'put  scope  block  on  symbol  stack',  and  12  =  'remove 
scope  block  from  symbol  stack*,  we  have  a  series  of  actions 
that  just  waste  time.  We  must  assume  that  any  action 
sequence  written  as  one  sequence  by  the  user  is  consistent; 
but  any  two  sequences  that  end  up  being  merged  into  one,  but 
net  written  that  way  by  the  user,  may  be  inconsistent . 
Therefore,  whenever  this  case  occurs  the  user  is  notified  so 
the  sequence  can  be  hand-checked  for  consistency. 


4.4.3:  Where  Can  We  Place  Semantic  Actions? 

Lewis  and  Stearns  [Lewis  and  Stearns  1968]  demonstrate  a 
method  of  determining  just  where  distinct  semantic  action 
symbols  may  be  placed  in  a  production,  by  generating  what 
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they  term  a  derived  symbol  Polish  grammar  from  the  original 
grammar,  and  then  determining  via  a  complex  algorithm  the 
distinction  index  of  particular  pairs  of  symbols. 

Each  production  of  the  original  grammar  is  transformed 
into  derived  symbol  Polish  form  by  substituting  for  each 
grammar  symbol  X  a  new  symbol  W(X,i),  where  the  i  is  an 
integer  reflecting  which  instance  of  X  has  been  supplanted- 
This  new  symbol  W(X,i)  is  treated  as  a  nonterminal  symbol  in 
the  derived  grammar,  and  given  a  defining  production 
W(X,i)  ->  X  .  Its  translation  element  is  Xw*(X,i),  if 
X  €  Vn,  or  simply  w»(X,i) ,  if  X  €  Vt.  The  symbol  w' (X,i)  is 
essentially  the  action  symbol  associated  with  the  ith 
reference  to  X. 

Two  derived  symbols  W  (X,i) ,  W(Y,j)  are  called  compatible 
if  X  =  Y,  and  i  #  j,  that  is,  if  they  are  generated  from 
different  references  to  the  same  symbol. 

The  distinction  index  of  a  pair  of  compatible  symbols 
W(X,i),  W(X,j)  is  the  smallest  integer  greater  than  the 
lengths  of  the  w  in  Vt*,  such  that  for  some  v^,w^,v^  €  Vt*, 
both  wiW(X,i)ww2  and  wiW(X,j)ww3  are  sentential  forms  of  the 
Derived  Symbol  Polish  grammar.  If  no  such  w  exists,  then 
the  distinction  index  is  defined  to  be  zero;  if  the  lengths 
of  the  w  are  unbounded,  then  the  index  is  infinity. 

Intuitively,  the  distinction  index  of  a  pair  of 
compatible  symbols  W(X,i)  ,  W(X,j)  is  the  number  of  symbols 
required  in  the  lookahead  string  to  enable  the  parser  to 
determine  which  one  of  the  two  references  is  being  used  at  a 
particular  point  in  a  parse.  If  this  index  is  less  than  or 
equal  to  k,  then  we  can  associate  different  semantic  action 
symbol  sequences  with  each;  if  it  is  greater  than  k,  then  we 
can  either  increase  our  lookahead  set  to  the  greater  number 
(if  the  index  is  finite!),  or  simply  demand  that  all 
semantics  associated  with  the  two  compatible  symbols  be 
identical.  In  the  terminology  of  Lewis  and  Stearns,  we 
equate  all  compatible  symbols  whose  distinction  index  is 
greater  that  k. 

Lewis  and  Stearns  go  on  to  prove  that  distinction 
indices  are  effectively  computable  for  LR  (k)  grammars.  Once 
these  indices  are  computed,  if  all  compatible  pairs  with 
distinction  index  greater  than  some  n  >  k  are  equated,  then 
the  resulting  grammar  is  LR  (n) . 

The  algorithm  computing  the  distinction  indices  is  quite 
complex.  However,  when  k  <  1,  the  computation  of 
distinction  indices  is  not  necessary  to  determine  the  points 
where  distinct  semantic  action  symbol  sequences  may  be 
placed  in  productions,  as  a  consequence  of  the  following 
theorem; 
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Theorem  4.4.3. 1: 


The  distinction  index  of  a  compatible  pair  of  symbols 
W  (X,i)  ,  W(Xrj)  is  Z§.LQ  if  only  if  items  generated  from 

the  ith  and  jth  references  to  symbol  X  never  simultaneously 
appear  with  X  as  their  transition  symbol  in  the  same  item 
set  in  the  LE  (0)  collection  constructed  from  the  original 
grammar. 

Proof : 

a)  If  no  w  €  Vt*  exists  such  that  wiW(X,i)ww2  and 
wiW(X,j)ww3  are  both  sentential  forms,  for  some 
wi,w2,w3  €  Vt*,  then  no  €  Vt*  exists  for  which  both 
wiW(X,i)  and  wiW(X,j)  are  prefixes  of  sentential  forms:  if 
this  wi  did  exist,  then  we  could  set  w2  =  w^,  and  find  some 
w  such  that  both  wiW(X,i)ww2  and  wiW(X,j)ww3  would  be 
sentential  forms,  contradicting  our  assumption.  But  the 
requirement  for  the  ith  and  J^th  references  to  X  to  be 
transition  symbols  in  the  same  item  set  is  that  some 
exist;  we  must  have  prefix  z  6  V*  such  that  z  =^*  w* ,  with 
zW(X,i)  and  zW(X,j)  prefixes  of  some  sentential  form.  Since 
no  wi  exists  in  the  necessary  form,  the  ith  and  j;^ 
references  to  X  are  never  transition  symbols  in  the  same 
item  set. 

b)  Suppose  the  ith  -and  jth  references  to  symbol  X  are 
never  transition  symbols  in  the  same  item  set.  Then  there 
do  not  exist  z  €  V*  and  r  €  Vt*  such  that  zW(X,i)r  and 
zW(X,j)r  are  sentential  forms.  Thus  there  is  no  w  as  used 
in  the  definition  of  distinction  indices,  and  the 
distinction  index  of  W(X,i)  and  W(X,j)  is  zero. 

Q  .  E. D  . 


Corollary  4. 4. 3.2: 

If  two  compatible  symbols  W(X,i),  W(X,j)  are  such  that 
the  ith  and  jth  references  to  the  symbol  X  are  transition 
symbols  in  the  same  item  set  of  the  LE  (0)  collection 
constructed  from  the  original  grammar,  then  the  distinction 
index  of  the  symbols  W(X,i),  W(X,j)  in  the  derived  symbol 
Polish  grammar  is  greater  than  zero. 

Q.  E.  D. 


Thus  we  avoid  semantic  action  conflicts  between  distinct 
references  to  the  same  symbol  by  demanding  that  sequences  of 
semantic  action  symbols  associated  with  two  different 
references  to  the  same  grammar  symbol  must  be  identical  if 
the  two  references  ever  exist  as  transition  symbols  in  the 
same  item  set. 
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Theorem  4, 4, 3. 3: 


Our  constructor  can  correctly  process  any  LL(1)  grammar 
with  distinct  semantic  action  symbols  placed  at  arbitrary 
points  of  productions. 

Proof : 

We  have  already  shown  that  our  system  processes  those 
LALP(1)  grammars  with  semantic  actions  associated  at 
arbitrary  points  of  productions,  as  long  as  distinct 
references  to  the  same  symbol  which  appear  as  transition 
symbols  in  the  same  item  set  have  identical  semantic  actions 
associated  with  them.  To  prove  the  theorem,  we  first  must 
show  that  LL(1)  grammars  are  also  LALE(1),  and  that  in  LL(1) 
grammars,  it  is  never  the  case  that  two  references  to  the 
same  symbol  appear  as  transition  symbols  in  the  same  item 
set.  Both  of  these  conditions  are  easy  to  show.  Pirst,  it 
is  well  known  that  LL(1)  grammars  are  a  subset  of  LALE(1) 

([ Aho  and  Oilman  1973a],  as  well  as  Theorem  4. 3. 5. 4). 

Second,  part  <iii)  of  the  proof  of  Lemma  4. 3, 5,1  shows  that 
each  item  set  has  exactly  one  item  for  each  transition 
svmbol.  The  only  thing  remaining  to  be  shown  is  that  we  can 
process  semantic  action  symbols  that  are  not  associated  with 
other  grammar  symbols.  This  class  of  semantic  actions 
arises  from  only  one  source:  the  completion  of  productions. 
For  this  case,  we  note  that  we  associate  the  semantic 
actions  with  the  lookahead  set  —  those  strings  that  may 
validly  follow  the  nonterminal  symbol  involved  in  the 
reduction,  in  this  item  set.  Since  LL(1)  grammars  are  a 
subset  of  LALB  (1),  we  know  that  this  lookahead  set  is 
disjoint  from  any  other  lookahead  sets  and  from  the  set  of 
transition  symbols  in  the  item  set.  Therefore  there  can  be 
no  semantic  action  conflict. 

Q  .  £ .  D  . 


Hence  we  have  determined  that  we  can  process  all  LL(1) 
grammars.  We  demonstrated  earlier  that  not  all  LALB(1) 
grammars  are  processable  when  distinct  semantic  actions  are 
embedded  at  arbitrary  points  of  productions:  we  might 
uncover  semantic  action  conflicts  when  distinct  semantics 
are  associated  with  two  different  references  to  the  same 
symbol . 

It  is  worth  noting  that  the  Derived  Symbol  Polish 
grammars  constructed  from  LR(k)  grammars  effectively  right- 
associate  semantic  action  sequences  with  symbols,  by 
associating  the  action  sequences  with  the  completion  of  the 
new  productions,  and  allowing  only  one  symbol  per  new 
production.  Thus  one  can  not  associate  any  semantic  action 
to  the  left  of  the  first  symbol  of  any  production.  Our 
method  provides  this  opportunity,  for  each  production  except 
those  involving  left  recursion  (i.e.,  of  the  form  A  ->  Az, 
for  A  €  Vn,  and  z  €  V=«')  .  This  restriction  is  due  to  the 
effects  of  propagation. 
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4.4.4:  Summary  and  Examples  of  Association  and  Propagation 

We  allow  the  association  of  semantic  action  symbol 
sequences  with  arbitrary  points  in  a  production,  as  long  as 
identical  semantic  action  symbols  are  associated  with  two 
distinct  references  to  the  same  symbol  which  appear  as 
transition  symbols  in  the  same  item  set.  Unfortunately,  the 
user  cannot  easily  determine  where  distinct  semantic  action 
symbols  may  be  placed  from  inspection  of  the  grammar. 
Therefore,  one  of  the  system  options  is  a  listing  of  the 
grammar,  with  points  marked  where  semantic  actions  must  be 
identical.  This  listing  costs  nothing  extra,  since  it  is 
output  during  the  computation  of  the  LR(0)  collection  of 
item  sets,  and  the  information  must  be  checked  anyway.  We 
next  present  two  examples  which  embed  semantic  action 
symbols  in  grammars.  The  first  is  a  simple  example  taken 
from  [Lewis  and  Stearns  1968],  which  demonstrates  the  method 
of  determining  where  distinct  semantic  actions  may  be 
placed.  We  have  modified  the  names  of  the  terminal  symbols, 
in  order  to  increase  readability.  The  grammar  is: 

(1)  S  ->  aSb 

(2)  S  ->  aSa 

(3)  S  ->  c 

We  first  place  action  symbols  at  each  possible  place. 

We  then  determine  which  action  symbols  must  be  identical  to 
retain  the  LR(0)  property  for  this  grammar. 

(1)  S  ->  1  a  2  S  3  b  ♦ 

(2)  S  ->  5  a  6  S  7  a  8 

(3)  S  9  c  10 

We  next  begin  to  compute  the  LP  (0)  collection  of  the 
original  grammar: 

0:  S'  ->  -S 

S  ->  -aSb 
S  ->  •aSa 

S  ->  ^c 

Since  items  [S  •aSb]  and  [  S  •aSa]  have  the  same 
transition  symbol  in  item  set  0,  we  equate  actions  i  and  8, 

1:  S'  s« 

2:  S  ->  a«Sb 
S  ->  a*Sa 

Similarly,  since  items  [ S  a«Sb]  and  [ S  ->  a*Sa]  have 
the  same  transition  symbol  in  item  set  2,  we  equate  actions 
associated  with  them,  namely  actions  2  and  8,  Continuing  in 
this  way^  we  arrive  at  the  general  schema 

(1)  S  ->  1  a  2  s  3  b  ♦ 

(2)  S  1  a  2  s  2  a  8 

(3)  S  9  c  10 
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We  may  derive  the  schema  in  the  normal  SDTS  notation  by 
setting  ^  2^  ^2  =  3  4^  ®,  and  ^  we  then 

obtain 

(1)  S  aSb  {wiSw2} 

(2)  S  ->  aSa  {wiSw3} 

(3)  S  ->  c  {¥♦} 

which  is  identical  to  that  obtained  in  [Lewis  and  Stearns 
1968],  modulo  different  terminal  symbols. 

We  have  in  effect  generated  the  ’general  form'  Polish 
grammar,  and  then  equated  action  symbols  as  necessary  to 
retain  the  LB  property  of  the  original  grammar. 

Let  us  run  through  a  somewhat  more  complex  example, 
which  demonstrates  propagation.  Consider  the  sequence  of 
productions  defining  a  Boolean  selection  construct: 

(1)  if  ->  'if  be  'then'  A1  A2  'end' 

(2)  A1  s 

(3)  A1  ->  A1  s 

(4)  A2  ->  'else'  A1 

(5)  A2  e 

where  we  have  denoted  a  Boolean  expression  by  be,  and  a 
statement  by  s.  For  the  purposes  of  this  example,  we  assume 
that  both  ^  and  s  are  terminal  symbols.  The  'semantic 
schema'  of  this  grammar  is: 

(1)  if  1  'if  2  be  3  'then'  ♦  A1  s  A2  ®  'end'  ^ 

(2)  A1  ->  a  s  9 

(3)  A1  ->  10  A1  11  s  12 

(4)  A2  13  'else'  A1 

(5)  A2  ->  16 

We  next  derive  the  LR(0)  collection  of  item  sets  for  the 
original  grammar,  with  SLB(1)  conflict  resolution.  On  the 
left,  we  have  indicated  the  items;  on  the  right,  we  have 
indicated  the  transition  information  between  the  item  sets. 
We  have  marked  at  the  far  right  those  item  sets 
necessitating  semantic  actions  to  be  equated. 
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in  uQ 


0  : 

if 

-> 

•  'if  be  'then'  A1  A2  'end' 

I  1 

'  if*  : 

to 

1 

1 : 

if 

-> 

'if'  •be  'then'  A1  A2  'end' 

1  1 

be: 

to 

2 

2 : 

if 

-> 

'if*  be  •'then*  A1  A2  'end* 

i  1 

'then* : 

to 

3 

3 : 

if 

-> 

'if  be  'then*  •Al  A2  'end* 

I  I 

Al : 

to 

4  4=  i  0 

A1 

-> 

•  Al  s 

1  1 

A1 : 

to 

4 

A1 

-> 

•  s 

1  1 

s: 

to 

5 

4: 

if 

-> 

'if  be  'then*  A1  •A2  'end* 

i  1 

A2: 

to 

6 

A1 

-> 

Al  •s 

1  1 

s: 

to 

7 

A2 

-> 

•'else*  Al 

1  1 

*  else* : 

to 

8 

A2 

-> 

• 

1  1 

*  end  * : 

R5 

5: 

A1 

-> 

i  i 

R2 

6  : 

if 

-> 

'if  be  'then*  Al  A2  •'end* 

i  i 

*  end ’ : 

to 

9 

7; 

A1 

-> 

Al  s • 

1  i 

R3 

8: 

A2 

-> 

'else*  •Al 

i  1 

Al : 

to 

10  1  ♦  = 

A1 

-> 

•  A1  s 

i  1 

Al : 

to 

10 

A1 

-> 

•  s 

1  1 

s: 

to 

5 

9  : 

if 

-> 

'if  be  'then*  Al  A2  *end*« 

1  1 

R1 

10 

:  A  2 

-> 

'else*  A1« 

i  1 

1  i 

' else* , 

*  end ' : 

R4 

A  1 

-> 

Al  •s 

1  1 

s : 

to 

7 

Thus,  it  appears  that  any  directly  left  recursive 
nonterminal  symbol  causes  all  action  symbols  positioned 
immediately  to  the  left  of  any  of  its  references  to  be 
uated  in  the  LB  (0)  collection  computation  process.  This 
not  always  desirable,  as  we  can  easily  demonstrate  by 
assigning  particular  values  to  the  actions  in  this  example: 

let  i=2=7=8=io=ii=i3=i6=null; 
let  ♦=! ♦=#indent ; 
let  3=#indent  #new_line; 
let  s=6= i 5=#exdent ;  and 
let  ^=i2=#pew  line. 


Then  the  grammar  becomes: 


(1) 

if 

-> 

'if*  be  tindent  #new_line 
#exdent  A2  texdent  'end* 

(2) 

Al 

-> 

s  #new_line 

(3) 

Al 

-> 

Al  s  #new_line 

(4) 

A2 

-> 

'else*  #indent  Al  #exd€nt 

(5) 

A2 

-> 

€ 

' then • 


#indent  A1 
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These  productions  define  the  Boolean  selection 
construct,  with  optional  else-clause,  paragraphed  as 
follows: 

if  <be> 

then  <stmt> 

<optional  stints> 
else  <stmt> 

<optional  stnits> 

end 

We  have  action  ♦  #  action  Moreover,  setting  action 

1 0  to  tindent  causes  many  extra  indentations,  one  per 
recognized  statement.  What  we  would  prefer  to  do  is  to 
perform  the  action  the  one  time,  but  not  more  than  that. 
Notice  that  this  problem  is  not  due  to  the  left  association 
we  have  been  using;  if  actions  were  right  associated,  then 
the  same  problem  would  occur  with  actions  and 

where  s=#exdent ,  and  ii=null.  We  cure  the  ill  by 
propagating  action  symbols  originally  to  the  left  of 
nonterminal  symbols,  to  the  left  of  items  formed  from  the 
defining  productions.  In  our  example,  item  sets  3,4,  and  8 
would  be  affected.  In  item  set  3,  any  semantic  actions 
associated  with  the  transition  symbol  A1  of  the  first  item 
would  be  propagated  to  the  left  of  the  items  based  on  the 
definition  of  A1,  We  have  effectively  added  during  closure 
the  item  [  A1  ->  based  on  the  production  A1  ->  *  'O  s, 

rather  than  A1  ->  ‘O  s.  Item  set  8  is  similar:  the  item 
[ A1  ->  •s]  is  based  on  the  production  A1  ->  lo  s.  In 
both  cases,  since  the  production  A  ->  s  is  recognized  at 
most  once  from  the  item  sets  (in  item  set  5) ,  the  actions 
are  performed  only  once  per  statement  list.  The  general 
schema  for  this  seguence  of  productions,  taking  propagation 
into  account,  is: 

(1)  if  ->  1  'if*  2  be  3  'then'  ♦  A1  s  A2  *  'end'  ? 

(2)  A1  8  s  9 

(3)  A1  ->  A1  11  s  12 

(4)  A2  13  'else'  A1  is 

(5)  A2  IS 

Action  ♦  need  not  be  identical  to  action  !♦, 

Therefore,  in  general,  a  directly  left  recursive 
production  may  not  have  semantic  actions  left-associated 
with  the  first  symbol  of  the  production. 

This  example  has  demonstrated  the  usefulness  of 
propagation,  in  that  it  allows  us  to  implement  semantic 
action  association  to  onl^  terminal  symbols,  which  is  quite 
easy. 


if  <be> 

—or —  then  <stmt> 

<optional  stmts> 

end 


4,5:  Table  Optimization 

The  problem  of  optimizing  LR  parse  tables  has  received 
the  most  attention  in  the  literature,  simply  because  even 


-66- 


the  LAIR(1)  method  creates  large  tables.  For  example, 
Anderson  reports  that  the  item  set  collection  produced  by 
his  SLR(1)  method  for  the  ALGOLW  language  contained  330  item 
sets,  and  about  6100  items  [Anderson,  Eve  and  Horning  1973]. 
Purdom  states  that  the  number  of  item  sets  in  the  LR(0) 
collection  can  be  estimated  quite  accurately  (to  within  a 
few  item  sets)  by  simply  using  the  sum  of  the  number  of 
grammar  symbols  on  the  right  hand  side  of  all  productions, 
plus  the  number  of  productions  [Purdom  1974]. 

There  are  several  very  worthwhile  optimizations  that  can 
be  made.  These  include; 

1)  sparse  matrix  representations; 

2)  LR  (0)  reduce  table  elimination; 

3)  single  production  elimination;  and 

4)  compatible  table  merger. 

All  of  the  optimizations  we  will  describe  maintain  the 
immediate  error  detection  property  of  LR  parse  tables;  the 
optimizations  that  impair  this  property  (such  as  general 
matrix  minimization  techniques)  are  not  considered  here. 


4.5.1:  Sparse  Matrix  Representations 

Under  this  heading,  we  include  the  techniques  of 
combining  the  f  and  q  functions  for  terminal  symbols,  and 
using  a  list  representation  of  the  tables. 

First,  we  combine  the  f  and  q  functions  for  terminal 
symbols,  by  noticing  that  f(X)  =  shift  if  and  only  if 
g  (X)  +  error,  and  X  €  Vt  [Anderson,  Eve  and  Horning  1  973  ], 

[ Aho  and  Johnson  1974].  We  simply  combine  the  values 
returned  by  the  two  functions,  into  one  ordered  pair.  For 
example,  the  transformed  entries  for  table  7  of  Figure 
4.3.  Id  would  look  like  the  following: 

f  M  g 

a  +  *  ()  e  M  ETFa  +  *  () 

7  s5xxS4xx||xx10xxxxx 

where  we  have  placed  an  *  x*  in  the  g-values  we  moved  over  to 
the  f-function. 

This  transformation  saves  no  space  in  itself;  rather,  it 
has  paved  the  way  for  another  transformation.  We  have  made 
the  g-functions  even  more  sparse  than  they  were,  without 
making  the  f*functions  more  dense.  At  this  point,  the  g- 
functions  have  non-error  entries  only  in  columns  headed  by 
nonterminal  symbols.  Now  we  can  represent  the  g-functions 
by  a  small  number  of  lists,  encoding  the  matrix  by  columns. 
For  example,  the  list  for  the  column  headed  by  the 
nonterminal  symbol  F  of  Figure  4, 3.  Id  would  have  the 
following  entries: 
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F;  0:3 
4:3 
6:3 
7:10 

where  the  first  number  of  each  pair  is  the  table  number 
containing  the  g-value,  which  is  the  second  number  of  the 
pair. 


We  can  encode  the  f-functions  in  the  same  manner,  but  by 
rows  rather  than  by  columns.  We  can  also  introduce  a 
default  action,  which  is  an  entry  taken  when  all  others  fail 
to  match  the  input  symbol.  The  f-function  list  for  table  7 
of  Figure  4, 3. Id  becomes: 

7:  a:  S5 

(:  S4 

def:  error 

We  also  apply  the  default  idea  to  the  column  lists  of 
the  g-function;  the  list  for  F  of  Figure  4.3. Id  becomes 

F:  7:  10 

def:  3 

Thus  we  are  starting  to  save  considerable  space,  at  some 
cost  in  access  time. 

We  normally  choose  the  default  entry  to  be  the  most 
commonly  occurring  entry  in  the  list,  to  reduce  the  size  of 
the  list  as  much  as  possible.  We  uniformly  make  the  default 
entry  the  last  one  on  the  list. 

We  can  save  even  more  space  by  using  our  knowledge  of 
how  the  tables  are  processed.  Consider  a  table  that  has 
both  error  and  reduce  entries.  We  can  replace  the  err^r 
entries  with  one  of  the  reduce  entries.  The  parser  using 
these  tables  will  then  make  a  series  of  reductions,  where  a 
parser  using  the  original  tables  will  announce  error,  when 
both  are  parsing  some  erroneous  input  string.  But  both 
parsers  announce  error  before  shifting  the  next  input  symbol 
[ Aho  and  Johnson  1974].  It  may  well  be  the  case  that  we 
want  to  perform  these  reductions,  to  ease  error  recovery. 


4.5.2:  Elimination  of  LR(0)  Reduce  Tables 

f-functions  such  as  are  in  tables  3,5,10,  and  1 1  of 
Figure  4. 3. Id  have  only  one  non-error  value  for  all  input 
symbols:  reduce.  We  may  replace  each  reference  to  the  table 
containing  this  1R(0)  reduce  action  with  a  new  action, 
called  shif t-reduce.  This  new  action  combines  the  two 
actions,  and  eliminates  all  references  to  the  tables 
containing  the  LR  (0)  reduce  actions.  We  may  then  eliminate 
these  tables  [Anderson,  Eve  and  Horning  1973],  [Horning 
1974  ]. 
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4.5,3:  Single  Production  Elimination 

It  is  often  the  case  that  programming  language  grammars 
contain  productions  of  the  form  A  B,  where  both  A  and  B 
are  nonterminal  symbols.  Usually,,  there  are  no  semantics 
associated  with  this  type  of  production,  so  that  the  actions 
involved  in  recognizing  B  and  reducing  it  to  A  are  not 
really  necessary. 

A  quick  glance  at  nearly  any  programming  language 
grammar  will  show  that  these  productions  are  frequently  used 
in  the  parts  of  the  grammar  defining  expressions,  usually  in 
the  description  of  precedence  levels  of  operators.  Consider 
our  arithmetic  expression  grammar  G: 

E  ->  E+T 

I  T 

T  ->  T*F 

1  F 

F  (E) 

I  a 

We  have  2  single  productions,  E  -*>  T,  and  T  ->  F.  In 
the  parsing  of  the  input  string  'a+a*,  the  parser  must 
sequentially  perform  the  reductions  by  F  ->  a,  T  ->  F ,  and 
E  ->  T,  on  the  first  'a*,  and  the  reductions  F  ->  a,  and 
T  ->  F,  on  the  second  *a*,  before  it  can  perform  the 
reduction  by  E  ->  E+T  (see  Appendix  F  for  an  example) . 

Clearly,  eliminating  all  reductions  by  single 
productions  possessing  no  semantic  significance  will  not 
affect  the  parser  output.  The  advantages  to  be  gained  by 
single  production  elimination  are  twofold:  the  parser  is 
faster,  and  is  often  smaller. 

That  this  optimization  is  worthwhile  can  be  demonstrated 
by  a  look  at  a  real  programming  language  grammar.  For 
example,  the  SUE. 8  grammar  refers  to  the  nonterminal  symbol 
expression  some  19  times  in  the  definitions  of  34  of  the  42 
nonterminal  symbols  in  the  language.  The  remaining  8 
nonterminal  symbols  are  involved  in  the  definition  of 
expression . 

Anderson  discusses  a  technique  of  eliminating  single 
productions  which  rids  the  tables  of  all  of  them,  but  the 
technique  may  increase  the  size  of  the  tables.  Aho  and 
Ullraan  [Aho  and  Ullman  1973b]  discuss  a  slightly  different 
technique  which  does  not  increase  the  size  of  the  tables, 
and  yet  eliminates  many  of  the  single  productions.  In 
general,  their  technique  does  not  eliminate  all  of  the 
single  productions.  Pager  [Pager  1973b]  describes  a 
technique  which  eliminates  all  the  single  productions,  an.d 
decreases  the  size  of  the  tables.  His  algorithm  is  fairly 
complex.  All  of  the  above  methods  first  create  some 
representation  similar  to  the  tables  we  have  described,  and 
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then  attempt  to  eliminate  the  single  productions.  LaLonde 
[ LaLonde  1976]  describes  a  method  of  directly  constructing 
tables  without  single  productions.  His  methods  involve  a 
decremental  approach  to  table  construction. 


4.5.4:  Compatible  Table  Merger 

The  basic  idea  of  this  optimization  is  that  quite  often 
several  tables  have  identical,  or  at  least  compatible,  f- 
entries,  and  therefore  can  be  combined. 

We  must  be  careful  of  how  we  define  compatible,  however; 
if  we  define  it  quite  generally,  as  in 

fi  and  f2  are  compatible  if  either  f ^  (X)  =  f2  (X)  ,  or  one 
of  f 1  (X)  and  f2 (X)  is  error,  for  all  lookahead  symbols 

X, 

then  we  lose  our  immediate  error  detection  property.  If  we 
define  compatibility  very  narrowly,  as  in 

fi  and  f2  are  compatible  only  if  f*^(X)  =  f 2  (X)  ,  for  all 
lookahead  symbols  X, 

then  we  can  only  combine  tables  that  have  identical  f- 
functions;  these  may  be  few  and  far  between. 

Aho  and  Oilman  [ Aho  and  Oilman  1972b]  give  an 
interesting  technique  for  merging  tables.  They  first 
discuss  don  *  t-care  entries  in  the  tables.  It  can  be  shown 
that  some  of  the  error  entries  in  the  parsing  action 
functions  will  never  be  referenced,  even  when  parsing  an 
erroneous  input  string.  These  positions  can  be  overlaid  as 
necessary  to  compact  the  tables,  with  no  loss  of  immediate 
error  detection.  Aho  and  Oilman  give  algorithms  for 
determining  the  don* t-care  entries,  for  several  different 
table  construction  methods.  They  then  give  an  algorithm  for 
merging  compatible  tables.  Their  definition  of  compatible 
is  essentially  the  following: 

fi  and  f2  are  compatible  if  either  f '  (X)  =  f 2  (X) ,  or  one 
of  f 1  (X)  and  f2 (X)  is  don  * t-care,  for  all  lookahead 
symbols  X. 

Both  Anderson  and  LaLonde  [LaLonde  1971]  describe  the 
list  representation  form  of  compatible  table  merger,  with 
the  very  narrow  definition  of  compatible. 

Pager  [Pager  1973a]  introduces  two  new  actions  into  his 
list  representation  form  of  the  tables.  One  is  essentially 
^  90t o,  so  that  the  list  for  one  table  can  join  the  list  for 
another  table  at  an  arbitrary  point  in  the  latter  list. 

This  allows  otherwise  incompatible  tables  to  be  merged  on 
their  compatible  subsets. 
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All  of  the  above  methods  suffer  from  the  complexity  of 
the  algorithms  determining  the  compatible  tables.  LaLonde's 
method,  which  is  the  simplest  of  the  above  methods, 
typically  reduces  the  size  of  the  lists  by  a  factor  of  two 
[Horning  1974a].  The  others  do  at  least  as  well. 


4.5.5:  Other  Methods 

Joliat  [ Joliat  1973]  describes  an  interesting  approach 
to  LR  table  optimization.  He  keeps  the  matrix  form  of  the 
tables,  rather  than  using  some  sparse  matrix  representation. 
He  then  factors  the  one  matrix  into  several  special-purpose 
matrices,  and  uses  general  matrix  reduction  techniques  on 
each  matrix,  independently  of  all  the  others.  The  parsing 
system  using  the  tables  produced  by  Joliat* s  constructor 
must  perform  a  separate  lookup  in  several  matrices,  to 
determine  whether  or  not  the  input  symbol  is  legal,  what 
parsing  action  to  perform,  etc.;  but  each  reference  is  very 
fast.  Horning  states  that  the  LR  parsers  constructed  by 
Joliat* s  techniques  are  "...probably  about  as  small  and  fast 
as  any  available  table-driven  parsers"  [Horning  1974a]. 

LaLonde  [LaLonde  1971]  describes  several  other 
optimizations  which  further  reduce  the  size  of  the  lists  his 
system  produces.  He  sorts  the  tables,  based  on  which  kind 
they  are  (shift  or  reduce) ,  and  maintains  separate  pointers 
to  the  boundaries  of  each  type.  He  splits  tables  that 
contain  both  shift  and  reduce  entries,  so  that  the  division 
of  the  two  types  of  tables  is  effective.  Then  the  actual 
action  in  each  table  entry  need  not  be  coded:  we  merely  code 
the  next  table  to  use  for  shift  actions,  or  the  production 
number  for  reduce  actions.  The  action  to  perform  is  deduced 
from  the  table  index. 


4.6:  Constructor  Output 

In  this  section,  we  will  describe  all  the  information 
produced  by  the  constructor  system.  We  will  first  describe 
the  information  listed  by  the  constructor  for  the  user.  We 
will  then  discuss  the  information  produced  for  the  various 
parts  of  the  parsing  system. 

4.6.1:  To  the  User 

Information  is  output  from  each  major  section  of  the 
constructor.  Nearly  all  listings  are  optional,  in  the  sense 
that  the  user,  via  the  use  of  either  toggles  or  by  selecting 
the  appropriate  option,  may  select  just  the  output  he 
desires  for  a  particular  run. 

4.6. 1.1:  Grammatical  Input  and  Transformation 

Output  provided  by  this  phase  of  the  constructor  is  an 
echoing  of  the  information  provided  as  input  by  the  user. 
This  includes  listings  of  the  symbol  declarations,  the 
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toggle  settings,  and  the  input  grammar.  A  listing  of  the 
current  options  in  effect  is  the  only  output  of  this  phase 
which  is  not  optional.  All  the  listings  are  paragraphed. 

4.6. 1.2:  Grammatical  Analysis 

This  phase  provides  much  information  for  the  user,  in 
the  form  of  a  cross  reference  listing,  and  a  listing  of  the 
grammar  after  transformation.  This  listing  is  in  a  depth- 
first  canonical  order,  as  illustrated  in  Appendix  E. 

The  cross  reference  map  consists  of  an  alphabetical 
listing  of  each  declared  symbol,  and  a  sequence  of  indices 
of  productions  in  which  the  symbol  appears.  These  indices 
refer  to  the  canonically-ordered  listing,  if  it  was  output; 
otherwise,  they  refer  to  the  input  order,  which  was  used  in 
the  grammar  listing  of  the  input  phase. 

The  constructor  can  also  output  the  FIRST  and  FOLLOW 
sets  of  each  nonterminal  symbol,  as  well  as  the  set  of 
nonterminal  symbols  generating  the  null  string. 

The  grammatical  checks  performed  by  this  phase  will  also 
generate  some  output.  If  the  grammar  is  ambiguous,  due  to 
one  or  more  nonterminals  being  left  and  right  recursive,  an 
error  message  is  output,  and  the  constructor  halts.  Any 
symbols  and  productions  not  used  in  generating  sentences  in 
the  language  are  identified. 

4. 6. 1.3:  Construction  of  the  LR(0)  Collection 

Several  options  are  provided  here.  The  LR  (0)  collection 
of  item  sets  may  be  output  as  it  is  being  constructed.  For 
each  item  set,  the  basis  set  is  listed,  along  with  the 
transition  symbols  for  intermediate  items  and  the  required 
lookahead  sets,  if  any,  for  final  items.  Item  sets  with 
conflicts  are  flagged. 

As  was  mentioned  in  the  final  paragraph  of  Section  4.4, 
this  phase  also  produces  another  listing  of  the  grammar, 
this  time  with  positions  marked  where  distinct  semantic 
action  symbols  may  be  placed.  We  are  considering  the 
additional  output  of  a  small  reference  map  listing  the 
semantic  action  symbols  and  the  positions  in  which  they  may 
appear  in  this  schema  grammar. 

4.6. 1.4:  Optimization 

This  section  simply  reports  on  the  amount  of  space  saved 
by  various  optimizations. 

4.6.1.  5:  Output 

While  the  constructor  is  preparing  output  for  the 
parsing  system,  it  presents  summaries  of  this  information  to 
the  user.  We  will  discuss  this  information  in  Sections 
4.6.2  and  4.6.3. 
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4.6. 1.6:  Table  Usage  Statistics 

The  final  output  of  the  system  is  a  listing  of  the 
number  of  entries  used  in  the  various  dictionaries  and 
tables  throughout  the  constructor. 


4.6.2:  To  the  Screener 

The  constructor,  assuming  some  ordering  of  language 
tokens,  prepares  a  table  that  maps  source  language  tokens 
into  parse  tokens.  This  allows  users  to  employ  their  own 
ordering  of  language  tokens,  or  not,  as  they  wish,  without 
affecting  the  parser.  If  the  ordering  of  language  tokens  is 
modified,  the  user  must  modify  the  mapping  table 
accordingly.  This  table  is  also  listed  for  the  user. 

The  screener  examines  each  identifier  extracted  from  the 
input  by  the  scanner,  to  determine  if  the  identifier  is  a 
keyword  of  the  language.  This  determination  is  performed  by 
searching  a  keyword  table.  The  keyword  table  is  initialized 
by  a  series  of  invocations  of  a  SUE  macro.  This  series  of 
statements  is  provided  to  the  screener  by  the  constructor. 
Thus  the  constructor  supplies  the  screener  with  the 
keywords,  but  the  screener  actually  defines  the  method  used 
to  fill  and  search  the  keyword  table.  The  constructor  knows 
nothing  of  these  methods,  which  is  as  it  should  be.  Users 
of  the  constructor  are  then  free  to  define  their  own 
methods,  and  modify  the  screener,  as  was  discussed  in 
Section  3.2. 


4.6.3:  To  the  Parser  and  Parse  Output  Processor 

The  constructor  provides  the  parser  with  the  parse 
tables,  and  also  provides  a  table  of  semantic  action  codes 
to  both  the  parser  and  the  parse  output  processor.  In  order 
to  provide  the  parse  tables  to  the  parser,  the  constructor 
produces  a  sequence  of  statements  which  becomes  the  body  of 
a  macro.  This  macro  is  defined  in  one  of  the  first  context 
blocks  of  the  parsing  system,  in  order  to  ease  the  insertion 
of  new  tables. 
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Chapter  5 

Summary  and  Conclusions 


5.1:  Summary  and  Further  Questions 

In  this  thesis  we  have  developed  the  theory  of  LR 
parsing  and  parse  table  construction,  using  Aho  and  Ullman’s 
model.  We  have  described  the  design  of  an  efficient 
implementation  of  the  parsing  and  construction  processes, 
with  extensive  motivation  for  the  methods  chosen.  We  have 
also  described  methods  allowing  the  inclusion  of  several 
desirable  features  not  found  in  current  constructors,  such 
as  semantic  actions  at  points  in  a  production  other  than  the 
completion  of  the  production,  and  allowing  regular  right 
parts  in  productions.  Along  the  way,  we  have  proven  several 
theorems  of  interest,  in  particular  that  our  constructor 
(and  all  LALP (1)  constructors)  will  handle  any  11(1) 
grammar. 

We  may  intuitively  view  the  parsing  process  as  a 
sequence  of  hypothesis  formulations:  Each  hypothesis  states 
that  a  particular  production  was  used  to  produce  the  part  of 
the  input  string  currently  being  examined.  At  each  parse 
step,  the  input  string  is  used  to  perform  a  test  of  the 
current  hypotheses,  and  possibly  a  hypothesis  is  confirmed 
or  reiected.  The  LL(1)  parsing  technique  is  able  to  confirm 
or  reject  hypotheses  about  which  productions  were  used  to 
produce  a  certain  part  of  the  input  string,  immediately  upon 
input  of  the  first  symbol  of  that  part  of  the  string.  The 
LR  parsing  technique,  on  the  other  hand,  may  reject  a 
hypothesis  at  any  point  of  the  string,  but  can  accept  a 
hypothesis  only  after  exa mining  the  end  of  the  substr ing 
produced  by  the  production. 

Semantic  actions  embedded  in  a  production  require  the 
confirmation  of  the  hypothesis  that  this  production  was 
indeed  used  to  produce  the  current  part  of  the  input  string: 
we  must  know  whether  or  not  to  perform  the  actions.  In  an 
Ll(1)  parser,  this  hypothesis  has  already  been  confirmed  or 
rejected. 

In  an  LR  parser,  however,  encountering  semantic  actions 
embedded  in  a  production  forces  a  premature  confirmation  or 
rejection  of  the  hypothesis.  We  have  given  one  method  of 
making  this  determination,  which  works  for  a  certain  class 
of  grammars  containing  at  least  the  LL(1)  grammars  (but  not 
all  LALR(1)  grammars).  The  system  described  herein 
determines  whether  or  not  the  confirmation/rejection 
problems  can  be  determined  for  a  particular  grammar,  and 
constructs  parse  tables  for  those  grammars  for  which  the 
determinations  can  be  made. 
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Several  interesting  questions  remain. 

1)  Can  we  extend  the  class  of  grammars  that  our 
constructor  can  process,  from  LALE(1)  to  LR(1)?  Certainly, 
both  Adams  [Adams  and  White  1975“]  and  Aho  and  Ullman  [  Aho 
and  Ullman  1973a]  give  algorithms  for  splitting  item  sets. 

2)  Should  we  directly  construct  the  LR  (0)  collection  of 
item  sets  from  the  regular  right  parts?  At  present  we  first 
transform  the  regular  expressions  into  BNF.  However,  the 
relevant  theory  for  direct  construction  has  recently  been 
developed  in  [LaLonde  1975]. 

3)  What  further  optimizations  can  be  performed  to 
produce  a  minimal  set  of  tables?  Assuming  that  minimal 
tables  are  not  practically  computable,  what  is  a  good  set  of 
general  optimizations?  Is  it  possibleto  produce  some  sort 
q£  decision  tree  of  possible  optimizations,  so  that 
particular  measures  can  be  computed  from  a  grammar,  and  used 
to  walk  the  tree,  to  determine  which  optimizations  to 
perform  on  the  tables  generated  from  that  particular 
grammar?  When  work  first  began  on  this  thesis,  it  was  hoped 
that  the  constructor  would  be  capable  of  producing  for  the 
parser  not  only  the  parsing  tables,  but  also  the  ^cess 
n,e;Uiod.  Essentially,  the  constructor  would  determine  and 
apply  a  set  of  optimizations  dependent  on  the  particular 
grammar  which  was  being  processed.  When  the  tables  were 
output  the  definitions  of  the  macros 

' Det ermine_parsing_action * ,  • Next_table ' ,  and 

• Determine_table_after_reduction'  would  also  be  supplied  to 
the  parser.  This  parameterizes  the  method  of  access  to  the 
tables,  as  well  as  the  tables  themselves. 

4)  Even  though  LR  parsers  detect  errors  at  the  earliest 
possible  moment  in  the  input,  recovering  from  the  error  is 
another  story  altogether.  No  general  solution  seems  to 
exist  [Aho  and  Johnson  1974].  The  parser  currently  uses  an 
error  recovery  technique  which  has  proved  its  worth  in  an 
LL(1)  system  at  the  University  of  Toronto  [Barnard  1975], 
but  is  relatively  untested  in  a  bottom-up  environment. 
Recovery  consists  of  three  hierarchical  levels,  tokens 
within  lines,  lines,  and  programs.  The  techniques  we  employ 
at  the  first  level  are  due  to  Barnard;  the  other  two  levels 
form  a  basic  'panic  mode*  type  of  error  recovery  [Graham  and 
Rhodes  1973],  [Horning  1974b].  For  details,  see  the  user's 
manual  of  the  parsing  system  and  [Barnard  1975]. 

5)  In  discus^sing  semantic  actions,  we  have  only  dealt 
with  the  case  where  the  actions  are  performed  upon 
successful  recognition  of  some  lookahead  string.  We  have 
not  dealt  with  the  case  where  strings  are  not  found. 

Limited  use  of  this  feature  could  provide  error  recovery  for 
special  cases.  In  any  case,  it  is  a  subject  worth 
investigating. 
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5.2:  I irple mentation  Eesults 

The  constructor  is  in  the  final  stages  of 
implementation;  hence,  no  statistics  on  its  performance  have 
yet  been  gathered.  Ml  of  the  techniques  discussed  in  the 
thesis  are  implementable,  most  quite  efficiently.  We  should 
mention  two  points  here,  one  in  connection  with  the  language 
chosen  for  the  implementation,  and  the  other  an  interesting 
observation  about  regular  expression  grammars. 

The  language  in  which  the  constructor  and  parsing 
systems  are  implemented  is  SUE. 8,  a  compatible  subset  of  the 
SUE  System  Language.  SUE  was  not  chosen  just  for  its 
availability,  nor  for  the  extensive  availability  of 
expertise  about  the  compiler/run-time  system,  although  both 
are  very  important  considerations.  Rather,  SUE  was  chosen 
for  the  type  of  programming  it  imposes  on  the  user,  its 
readability,  and  its  extensive  type-  and  range-checking 
facilities.  PL/I  passes  the  availability  considerations, 
but  fails  the  latter  ones. 

However,  the  choice  of  SUE  as  our  implementation 
language  will  restrict  the  portability  of  the  constructor 
system,  until  the  SUE, 8  compiler  is  ready  for  widespread 
dissemination.  The  constructor  system  was  impacted  by  the 
choice  of  implementation  language  in  another  way:  since  SUE 
does  not  allow  run-time-computable  storage  allocation  (all 
storage  allocation  must  be  determinable  by  the  compiler), 
the  constructor  cannot  tailor  its  arrays  to  be  just  the 
right  size  for  each  grammar,  but  must  instead  use  some _ 
compiled-in  constant  size.  When  more  entries  are  required, 
the  constructor  must  be  re-^compiled . 

In  examining  the  grammar  of  the  SUE, 8  language  (Appendix 
E) ,  we  noticed  that  most  of  the  embedded  alternatives 
present  in  the  original  formulation  of  the  grammar  were 
expanded  when  semantic  actions  were  embedded  in  the 
productions.  This  is  partly  due  to  the  Polish-type  output 
we  desired,  and  partly  because  the  embedded  alternatives 
tended  to  break  up  'semantic  phrases'  —  syntactic  phrases 
with  meaning. 

Thus  measures  of  the  'goodness'  of  grammatical 
presentation  methods  must  take  more  than  just  the 
'smallness'  (i.e.,  the  number  and  complexity  of _ productions, 
and  the  number  of  nonterminals)  of  the  grammar  into  account. 
The  measures  must  also  reflect  the  amount  of  work  the 
compiler  writer  is  forced  to  perform  to  put  the  grammar  into 
the  form  necessary  to  give  him  the  results  he  wishes. 
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Appendix  A 

Formats  of  Dictionary  Entries 


(1)  N 
( 
( 
( 
( 
( 
( 


(2)  T 
( 
( 
( 
{ 


(3)  S 
( 
( 

(4)  P 

( 

( 

( 


(5)  I 
( 
( 
( 
( 
( 


where 


and 


onterminal  dictionary  entry: 

1)  string  descriptor  of  nonterminal  symbol  name 

2)  set  of  defining  production  numbers 

3)  list  of  basis  sets  accessing  this  symbol 
♦)  closure  of  this  nonterminal 

5)  set  of  production  numbers  referencing  this  symbol 

6)  flags: 

(i)  This  nonterminal  generates  the  null  string 

(ii)  This  nonterminal  generates  a  terminal  string 
Hii)  This  nonterminal  was  created  by  normalization 


erminal  dictionary  entry: 

1)  string  descriptor 

2)  list  of  basis  sets  accessing  this  symbol 

3)  set  of  production  numbers  referencing  this  symbol 
♦)  category  of  this  terminal  symbol 

(keyword,  special  character,  or  token  class) 

emantic  Action  dictionary  entry: 

1)  string  descriptor 

2)  set  of  production  numbers  referencing  this  symbol 


roduction  dictionary  entry: 


2) 


canonical  order  number 
nonterminal  dictionary  entry  number 

of  nonterminal  symbol  this  production  defines 
right  hand  side  pointers: 

(i)  beginning  position  in  right  hand  side  list 

(ii)  length  of  production  (in  symbols) 


tern  Set  dictionary  entry: 

1)  Access  Symbol,  E 

2)  Basis  set,  B  =  {(p»j)} 

3)  Transition  set,  T  =  {(y#A,s)} 

♦)  Reduce  set,  R  =  {(PrA,{y})} 

5)  Link  field  for  list  of  item  sets 
with  access  symbol  E 

p  is  a  production  index; 

j  is  a  symbol  index  in  the  right  hand  side 
of  production  p  (0<j<(#p)); 

A  is  an  ordered  semantic  action  sequence, 

to  be  performed  upon  recognizing  symbol  y; 

s  is  an  item  set  index; 

{y}  denotes  the  set  of  symbols  y  which  are  the  lookahead 
strings  for  reduction  by  production  p. 
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Appendix  B 


Other  Input  Formats 
and  Their  Transformations 


As  mentioned  in  the  discussion  of  the  transformation 
step  in  section  4,1,3,  three  input  formats  are  currently 
allowed:  BNF,  WRE,  and  XVWN. 

In  this  Appendix,  we  will  discuss  the  BNF  and  XVWN 
formats,  WRE  was  discussed  in  Section  4,1.3, 


B,1:  BNF 

The  BNF  format  requires  no  production  transformations 
whatsoever.  This  format  has  no  embedded  alternatives  and  no 
metaparentheses.  Nonterminal  symbols  are  enclosed  in 
angular  brackets  '<*  and  *>*;  any  positive  number  of 
enclosed  blanks  are  treated  as  one  blank,  and  the  first 
character  after  the  opening  angular  bracket  '<*  must  not  be 
a  blank.  Terminal  symbols  are  represented  as  themselves,  as 
are  semantic  action  symbols.  The  system  can  distinguish 
between  the  terminal  symbols  and  semantic  action  symbols  by 
referring  to  the  appropriate  declaration.  It  is  advisable 
for  the  user,  however,  to  start  all  semantic  action  symbols 
with  some  particular  character  (such  as  '$*  or  *#')r  in 
order  to  increase  the  readability  of  the  grammar, 

A  production  is  written  in  the  normal  BNF  format:  the 
metasymbol  read  'can  be  rewritten  as*,  separates  the 

nonterminal  symbol  being  defined  from  the  right  hand  side  of 
the  production.  Blanks  separate  symbols,  and  the  or-bar, 

'!*,  signals  an  alternative.  An  unfortunate  aspect  of  BNF 
is  that  there  is  no  explicit  production  endmarker. 

Therefore,  the  system  uses  a  fixed  format  to  enable  it  to 
delineate  the  beginning  of  a  new  production,  and  yet  remain 
an  at-most-one-sy mbol  lookahead  system.  Thus,  each 
nonterminal  being  defined  by  a  sequence  of  alternatives  must 
start  in  column  1  of  an  input  record,  and  the  symbols  on  the 
right  hand  side  of  a  production  must  not  use  column  1.  The 
symbols  and  *|'  may  appear  on  the  right  hand  side  of  a 

production  only  as  metasymbols,  but  not  as  terminal  symbols. 
If  the  symbol  *<•  is  to  appear  on  the  right  hand  side  of  a 
production,  then  it  must  be  immediately  followed  by  a  blank. 
We  suggest  that  if  the  user  has  a  choice,  he  should  select 
one  of  the  other  formats. 


•78- 


B.2:  XVWN 


XVWN  is  our  acronym  for  extended  Van  wijngaarden 
notation.  We  should  emphasize  notation t  rather  than  Van 
Wijngaarden  grammars,  which  are  two-^level  grammars.  We  do 
not  use  the  two-level  aspect  of  these  grammars;  rather,  we 
use  the  format ,  because  it  is  particularly  easy  to  read. 

Nonterminal  and  semantic  action  symbols  are  identifiers, 
as  are  found  in  most  programming  languages.  Symbols  are 
separated  by  commas.  A  production  consists  of  the 
nonterminal  being  defined,  a  colon  (which  is  completely 
analogous  to  BNF’s  *::=*)»  and  the  right  hand  side  of  the 
production,  followed  by  a  period.  Thus  the  production's 
completion  is  explicitly  marked. 

Alternatives  on  the  right  hand  side  of  a  production  are 
separated  by  semicolons,  which  are  analogous  to  BNF's  '|'. 

Terminal  symbols  in  VWN  are  lower  case  identifiers , _ 
whereas  nonterminal  symbols  are  uppercase.  Since  relatively 
few  input  devices  recognize  upper  and  lower  case,  we  copy 
WEE'S  method  of  representing  terminal  symbols:  they  are 
either  identifiers,  or  special  character  sequences,  in 
either  case  enclosed  in  single  quotes.  Any  embedded  single 
quotes  are  denoted  as  two  single  quotes.  Figure  B. 1  uses 
the  upper/lower  case  method,  for  readability. 

XVWN  provides  parentheses  as  metasymbols,  to  allow 
grouping  and  the  formation  of  embedded  alternatives.  In 
addition  (and  here  is  where  the  'extended'  comes  in),  we 
allow  curly  brackets  and  square  brackets  as  minor  uses  of  a 
second  level  of  the  grammar.  {x}  indicates  that  the 
sequence  x  is  optional.  [x,y]  indicates  a  list  of  one  or 
more  x's,  separated  by  y's.  Both  x  and  y  may  be  complex 
sequences  themselves.  Defined  in  XVWN,  these  extensions 
are: 

[x/y]:  X,  (y,  [x/y];  ). 

[X  ]:  X,  ([X];  )  . 

{X}:  x;  . 


-79- 


The  transformations  necessary  to  normalize  XVWN  (i.e. , 
transform  XVWN  into  BNF)  are  the  following: 


(1)  embedded  alternatives  —  roust  be  enclosed  in 
parentheses;  the  expansion  process  is  identical  to  that  used 
in  WRE. 


(2)  list  notation:  for  arbitrary  sequences  of  symbols 


C,D, 


(a)  list  with  delimiter 

Z:  C,  [a/b],  D. 
is  replaced  by 

Z:  C,  AB,  D. 

AB:  a;  AB,  b,  a, 

(b)  list  without  delimiter  (sequence) 

Z: 

is  replaced  by 

Z:  Cf  A^  D* 

A:  a;  A,  a. 


(3)  optionality: 

X:  C,  {a}  ,  D. 
is  replaced  by 

X:  C,  A,  D. 

A:  a;  . 


XVWN  GRAMMAR:  [XVWN  ROLE]. 

XVWN  RULE:  nonterminal  symbol,  colon, 

XVWN  RHS,  period. 

XVWN  RHS:  [[ITEM/  comma]/  semicolon]. 

ITEM:  nonterminal  symbol; 
terminal  symbol; 
left  parenthesis,  XVWN  RHS, 
right  parenthesis; 

left  square  bracket,  XVWN_RHS,  {slash,  XVWN_RHS} , 
right  square  bracket; 
left  curly  bracket,  XVWN  RHS, 
right  curly  bracket. 

Figure  B,1 

A  Parsing  Grammar  for  XVWN  written  in  XVWN 
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B.3:  WRE  Revisited 


In  our  second  transformation  step  for  WEE,  we  chose  to 
substitute  new  nonterminal  symbols  for  just  the  qi  and  xi 
parts  of  a  production.  We  could  also  have  chosen  to 
substitute  the  new  nonterminal  symbols  for  incremental 
ocrtions  of  the  production.  For  example,  for  the  production 
A  ->  h1(q1)x1  ...  hn(qn)xn  w  we  could  substitute  the 
productions 

1)  N1  hi  if  x1  =  ***,  or 

N1  hi  q1  if  x1  =  *+* 

2)  for  2<i<n:  Ni  ->  N(i-1)  hi  if  xi  =  *♦',  or 

Ni  ->  N  (i-1)  hi  gi  if  xi  =  '  +  ' 

3)  for  1<i<n:  Ni  ->  Ni  gi 

4)  A  “>  Nn  w 

This  latter  method  has  the  advantage  that  quite  often 
w  =  e,  and  we  may  save  a  nonterminal  in  a  slight 
optimization  which  rids  us  of  the  need  for  Nn .  The 
corresponding  requirement  for  the  previous  algorithm  would 
require  n  =  1,  and  hi  =  w  =  e.  The  advantages  of  the 
previous  algorithm  include  the  fact  that  it  is  simpler  to 
recreate  the  original  grammar  form,  and  in  some  cases  we  may 
reuse  the  new  nonterminals  in  other  transformations,  if  some 
of  the  qi  of  one  production  are  identical  to  some  gi  in  a 
production  already  transformed. 
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Appendix  C 
Left  versus  Eight 


Our  method  of  transforming  regular  expressions  involves 
the  substitution  of  left  recursion  for  regular  expression 
iteration.  We  use  left  recursion  rather  than  right  to  allow 
the  SLB  method  to  be  effective  more  often  in  resolving  any 
conflicts  in  the  possibly  inadequate  item  set,  after  our 
transformation.  For  example,  the  regular  expression 
production 

A  ->  (  X  ) +  X  y 

can  be  transformed  into  the  sequence 

A  ->  B  X  y 
B  ->  X 
B  ->  B  X 

using  left  recursion;  we  note  that  the  resulting  parse 
tables  are  SLE(1).  However,  transforming  A  ->  (  x  )+  x  y 
into 

A  ->  B  X  y 
B  X 

B  ->  X  B 

using  right  recursion,  we  note  that  the  resulting  parse 
tables  are  not  even  LALE{1),  much  less  SLR(1). 

Left  recursion  also  requires  less  parse  stack  space  than 
does  right  recursion. 

DeRemer*s  new  system  uses  right  recursion,  in  keeping 
with  the  LE  philosophy  that  the  parser  cannot  assume  the 
original  production  is  being  recognized  until  the  production 
is  completed  [DeEemer  1975].  His  system  handles  problems 
similar  to  the  above  by  utilizing  a  smarter  transformation 
algorithm  than  we  use. 
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Appendix  D 

A  Grammar  of  Constructor  Input 


Note:  this  listing  was  produced  by  the  constructor 

system, 

declare  nonterminal  (  bnf _alt ernative ,  bnf_grammar, 
bnf_rule,  declaration,  grammar,  option, 
program_input ,  wre_alternat ive,  wre_grammar, 
xvwn_grammar  ) ; 

declare  terminal  (  identifier,  lhs_non terminal, 
n ont e r min a 1_ symbol,  rhs_n on terminal, 
t erminal_£ymbol  ) ; 

declare  semantic  action  (  #bG gin_gr ammar ,  #end_gr ammar , 
texdent,  tindent,  #list_first,  #list_first, 
#list_follow,  #list_input,  #list_lrO,  #list_null, 
#list_scheraa,  #list-transform,  #list_xref, 

#ne¥_line,  #new_page,  #reset_dict,  #reset_tab, 

# sa ve_in_rhs,  #set_bnf,  #set_flag_f alse,  #set_goal, 
#set_new_nt,  #set_nt,  #set_sa,  #set_t,  #set_tab, 
#set_wre,  #set_xvwn,  #start_alt,  #store_id,  #tables, 
#use_lalr,  #use_slr  )  ; 

goal  symbol  is  prograni_input ; 

grammar  is  wre; 


program_^input  =  (  (  declaration  )  + 

(  ,  'options'  *=' 

'('  option  (  ','  option  )*  ')'  ';'  ) 
grammar  ) +  ; 


declaration  =  'declare'  (  'nonterminal'  #set_nt  , 

'semantic'  'action'  #set_sa  , 
'terminal'  #set_t  )  #indent 
'('  identifier  #store_id 
(  ','  identifier  #store_id  )*  ')'  ';' 
texdent  #new  line  : 


option 


'-'  #set. 

_flag_false  option  , 

'  echo ' 

#list_input  , 

/* 

input  option 

'transform'  #list_transf orm  , 

/* 

analysis  options 

*/ 

' xref ' 

#list_xref  , 

'  first ' 

#list_first  , 

' follow ' 

#list_follow  , 

'IrO  ' 

#list_lrO  , 

/* 

constructor  options  */ 

' lair* 

tuse^lalr  , 

*  null  * 

#list__null  , 

' schema' 

#list_schema  , 

-83- 


•sir'  #use_slr  , 

'tables'  ttables  ; 

grammar  =  'goal'  'symbol'  'is'  identifier  #set_goal  ';'  #new_line 
'grammar'  'is'  (  'wre '  ';'  #set_wre  #begin_grammar 

wre_grammar  , 

'xvwn'  ';'  #set_xvwn  #begin_gramma r 
xvwn_grammar  , 

'bnf  ';'  #set_bnf  #begin_grammar 
bnf_grammar  ) 

#end_grammar  ; 

bnf_grammar  =  {  bnf^rule  )+  ; 

bnf_rule  =  Ihs^nonterminal 

bnf_alternat ive  (  '('  bnf_alternat ive  )*  ; 

bnf_alternative  =  {  rhs_nonterminal  ,  terminal_syrabol  )*  ; 

wre_grammar  =  wre_rule  (  ';'  wre_rule  )*  '.'  ; 

wre__r\ile  =  nonterminal_symbol  '=' 

”  wre_alternat ive  (  ','  wre_alternative  )*  ; 

wre_alternat ive  =  (  nonterminal_symbol  , 

terminal_symbol  , 

' ('  wre_a Iter native 

(  'r'  wre  alternative  )*  ')' 

(  '+'  ,  '*'  ,  *?*  )? 

)*  ; 

xvwn_grammar  =  (  xvwn_rule  )  +  ; 

xvwn_rule  =  nonterminal_symbol  ':'  xvwn_r ight_hand_side  '.'  ; 

xvwn_right_hand_side  =  item  (  *,*  item  )* 

(  ' ; '  item  (  ' , '  item  ) *  ) *  ; 

item  =  nonterminal_symbol  , 
terminal_symbol  , 

*('  xvwn_r ight_hand_side  ')'  , 

'r ’  item  (  ' r '  item  ) ?  ' ]'  , 

'{'  xvwn_right_hand_side  }  ; 
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Appendix  E 

A  Grammar  of  the  SUE. 8  Language 


Note:  this  listing  was  produced  by  the  constructor 

system. 

declare  nonterminal  (  absolute_def inition,  alternative, 
assignmen t_statement,  case_label, 
compound_statemen t,  con ditional_ phrase , 
context_block,  context_declaration,  data_block, 
definition,  eol,  exit_label,  exit_statenient, 
expression,  f ull_def inition ,  id_list,  id_ty pe_list , 
indented^statements,  index_type,  inline_statement , 
invocation,  macro_body,  open_statement,  primary, 
primary_expression,  procedure,  procedure_decl, 
program_block,  program_body ,  guadrary,  guintary, 
record,  reference,  return_statement ,  secondary, 
selection_statement ,  sextary,  statement,  template, 
tertiary,  tuple,  type,  variant_subrecord  ); 

declare  terminal  (  identifier,  literal  ) ; 

declare  semantic  action  (  tabsolnte,  #accepts,  #add,  #addr, 
#and,  tarray,  #assign,  #assignment_target,  tat, 
tbegin,  tbit,  tboolean,  tease,  tcase_else, 
tcase_iabel,  tcharacter,  tchr,  tclass_en,  tclass_eg, 
tclass_i,  tclass_ls,  tclass__s,  tconstant,  tcontext, 
tcycle,  tdata,  tdecimal,  tdefinition,  tdiv,  tdo, 
tdot,  tempty,  tend_alternative,  tend_block, 
tend_case,  tend__compound,  t end_expression, 
tend_id_list,  tend_if,  tend_macro, 
t end_procedur e_type,  tend_record,  tend_statement, 
tend_tuple,  tend_variant ,  tenumer ation,  tegual, 
texdent,  texit,  texit_label,  texpression_macro, 
tfalse,  tfast,  tfull,  tgreater,  tgreater_eg,  thigh, 
tif,  tif_else,  tif_then,  tin,  tindent,  tinline, 
tinteger,  tless,  tless_eg,  tlow,  tmacro,  tmax,  tmin, 
tmod,  tmult,  tnegate,  tnew^line,  tnot,  tnot_egual, 
tnull,  tof,  topen,  tor,  tord,  tparen_type,  tpointer, 
tpowerset,  tpred,  tprocedure  tprogram,  trecord, 
tref,  tref erence_macro,  treturn,  treturns, 
tsingleton,  tsingle_e numeration,  tsize, 
tstring^name,  tsubtract,  tsucc,  tto,  ttrue,  ttuple, 
ttype,  ttyped  tunless,  tvariable  tvariant,  twhen, 
twith  ) ; 

goal  symbol  is  procedure; 

grammar  is  wre; 
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procedure  =  data_block 

(  context_block  ,  procedure  ) * 
prograra^block  ; 

data_block  =  ’data*  #data  identifier  #str ing_name 

(  ,  taccepts  *  ('  id_list  *)*  ) 

(  ,  'returns'  #returns  '('  identifier  ')'  ) 

';'  #indent  #new_linG 
(  definition  ,  invocation  )* 

#end_block  texdent  #new_line  ; 

id_list  =  identifier  (  ','  identifier  )*  #end_id_list  ; 

definition  =  'declare'  tdefinition 

(  id_type_list  , 

procedure_decl  '('  identifier  *)'  ) 
eol  , 

template  ; 

id_type_list  =  #variable  type  '('  id^list  ')' 

(  ','  tvariable  type  '  ('  id__list  ')'  )*  ; 

type  =  indGX_type  , 

'boolean'  #boolean  , 

'array'  #array  '  ('  index_type  ')  '  'of  #of  type  , 
'pointer'  #pointer  'to'  identifier  , 

•powerset'  tpowerset  'of  index__type  ; 

index_type  =  'integer'  #integer 

'bit'  #bit  ' ('  expression  ')'  , 

'  ('  #paren_type  expression  'to'  expression  ')  '  r 
identifier  r 

'  ('  #paren_type  identifier 

(  ','  tenumeration  id_list  ') '  , 

') '  #single_enumeration  )  , 

'character'  tcharacter  ' ('  expression  ') '  , 
'fast'  index_type  #fast  ; 

expression  =  sextary  #end_expression  ; 

sGxtary  =  guintary  (  '!'  #class^ls  quintary  #or  )*  ; 

quintary  =  quadrary  (  '£'  tclass^ls  quadrary  #and  )*  ; 

quadrary  =  tertiary 

(  '='  #class_eq  tertiary  tequal  , 

'<'  #class_en  tertiary  #less  , 

'>'  #class_en  tertiary  #greater  , 

'<='  #class_en  tertiary  #less_eg  , 

'>='  #class_en  tertiary  #greater_eq  , 

'  #class_eq  tertiary  #not_egual  , 

'in'  #class_s  tertiary  tin  )*  ; 

tertiary  =  secondary 

(  '+'  #class_i  secondary  tadd  , 

'-'  tclass  i  secondary  tsubtract  )*  ; 
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secondary  =  primary 

(  •**  #class_i  primary  #mult 
'/*  #class_i  primary  #div  , 

•mod*  #class_i  primary  #mod  )*  ; 

primary  =  reference  , 

primary_expression  ; 

reference  =  (  identifier  , 

’typed*  #typed  type  *  (•  reference  ')'  ) 

(  *5)*  tat  , 

*,'  tdot  identifier  , 
tuple  ) *  ; 

tuple  =  *('  ttuple  expression 

(  *,'  expression  )*  *)'  #end_tuple  ; 

primary_expression  =  *  ('  sextary  ') '  , 

'-*  primary  tnegate  , 
primary  tnot  , 
literal  , 

'addr*  '  (*  tref  reference  ')  *  taddr  , 
*chr'  •  (*  expression  ')  '  tchr  , 

•empty*  *(*  identifier  *)*  tempty  , 
'false*  tfalse  , 

•full*  '  (*  identifier  ')  *  tfull  , 

•high*  *  (*  index_type  *)  *  thigh  , 

•low*  *(*  index_type  *)*  tlow  , 

•max*  *(*  sextary  ',*  sextary  *)*  tmax  , 

*  min  *  *(*  sextary  *,*  sextary  ')*  tmin  , 
'null*  *(*  identifier  ')*  tnull  , 

*ord*  *(*  sextary  *)  *  tord  , 

'pred*  *  (*  sextary  *) *  tpred  , 
•singleton*  '(*  sextary  ')*  tsingleton  , 
•size*  *  (*  tref  reference  ') *  tsize  , 
•succ*  *  (*  sextary  *)  *  tsucc  , 

•true*  ttrue  ; 

procedure_decl  =  'procedure*  tprocedure 

(  ,  'accepts*  taccepts 

*  (*  type  (  * ,*  type  )  *  *)  *  ) 

(  ,  'returns*  treturns  *  ('  type  *)  *  ) 
tend_procedure_t ype  ; 

eol  =  *;*  tend_statement  tnew_line  ; 

template  =  tdefinition 

(  'constant*  tconstant  identifier  *=* 
expression  eol  , 

•type*  ttype  identifier  *=* 

(  type  f  record  )  eol  , 

•macro*  tmacro  identifier 
(  ,  ' (*  id_list  *) *  )  * ; * 

tindent  tnew_line  macro_body 
texdent  'end*  'macro*  eol  )  ; 
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record 


0 


=  'record'  #record  tirdent  #new_line 

(  id_type_list  (  ,  ','  variaiit_subrecord  ) 

variant^subrecord  ) 
texdent  #new_line  'end'  #end_record  ; 

variant^subrecord  =  'case'  #variant  index_type 
”  'tag'  identifier  ';' 

#  indent  #new_line 
alternative 
(  ';'  tend^alternative  alternative  )* 
texdent  #new_line  'end' 

# end_alternat ive  #end_variant  ; 

alternative  =  (  case_label  ':*  )+  id_type_list  ; 

case_label  =  (  primary_expression  »  identifier  ) 

(  t 

'to'  #to  (  primar y_expression  ,  identifier  )  ) 
#case_label  ; 

macro_body  =  (  invocation  )* 

(  » 

full_def ini tion 

(  fu ll_def init ion  ,  invocation  )*  , 
statement  {  statement  ,  invocation  )*  )  , 

reference 

#neH^line  tend^macro  #ref erence_macro  , 

'  ( '  expression  ' ) ' 

#new_line  #end_^macro  #expression_macro  ; 
invocation  =  reference  eol  ; 

full_def inition  =  definition  ,  absolute_def inition  ; 

absolute_def inition  =  'absolute'  #definition  #absolute 

'  ( '  expression  ' )  ' 
type  identifier  eol  ; 

statement  =  assignment_statement  , 
compound_stat ement  , 
exit_statement  , 
inline_statement  , 
return^statement  , 
selection^sta tement  , 
eol  ; 

assignmen t_statement  =  reference  ':='  tassignment^tar get 

expression  #assign  eol  ; 
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conipound_statement  =  exit^^label 

(  'do'  #do  identifier  *:=* 

expression  *to'  expression  •;* 
tindent  tnewline 
(  statement  ,  invocation  )* 
#exdent  *end*  #end_compound  , 
’cycle'  #cycle  tindent  #new_line 
(  statement  ,  invocation  )  =♦' 
texdent  'end'  #end_compound 
'begin*  #begin  #indent  #ne¥_linG 
program_body 

#exdent  'end*  #end_compound  ) 
exit^label  eol  ; 

exit_label  =  , 

*<•  #exit_label  identifier  '>*  ; 

program__body  =  (  definition  ,  invocation  )♦ 

(  • 

(  open_statement  ) + 

(  statement  ,  invocation  )*  , 
statement  (  statement  ,  invocation  ) *  ) 


open^statement  =  'open'  #open  reference  eol  ; 

exit_statement  =  'exit'  #exit  exit_label 

conditional_phrase  eol  ; 

condi tional_phrase  =  t 

'when'  #when  expression  , 

'unless'  #unless  expression  ; 

inline_statement  =  'inline'  tinline  *(*  expression 

(  expression  )*  *)'  eol  ; 

return_statement  =  'return*  treturn  conditional_phrase 

(  . 

'with*  twith  expression  ) 
eol  ; 
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selection  statement  = 


•if*  #if  expression 
♦indent  #nGw_line 
*then*  #if_then 
indented_statemGnts 

(  , 

*else*  #if_else 

indented_st atements  ) 
texdent  'end*  #end_if  eol  , 

'case'  tease  index_type  'tag* 
expression  * ; ' 
tindent  #new_line 
(  (  case_label  ':*  )+ 
indented  statements  ) * 

(  r 

'else*  #case_else  *:' 
indented__st atements  ) 
texdent  'end*  #end_case  eol  ; 

indented_statements  =  (  statement  ,  invocation  )  tindent 

(  statement  ,  invocation  )*  texdent  ; 

program_block  =  'program*  tprogram  identifier  ';* 

tindent  tnew__line  program  body 

tend_block  texdent  Inew^line  ; 

context_block  =  'context*  tcontext  identifier  ';' 

tindent  tnew_line 
(  context_declaration  )+  '^l_' 
tend_block  texdent  tnew^line  ; 

context_declaration  =  template  , 

absolut e_def inition  , 
invocation  , 
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Appendix  F 

An  Example  of  the  Canonical  Construction  Method 


In  this  appendix,  we  present  examples  of  the  algorithms 
discussed  in  Chapter  2.  We  will  construct  the  LE(1)  parse 
tables  for  the  arithmetic  expression  grammar 

G  =  <Vn  =  {E,T,F> , 

Vt  =  (,)}  , 

F  6  Vn, 

P>,  where  P  is  the  set  of  productions 

1)  E  ->  E+T 

2)  1  T 

3)  T  ->  T*F 

4)  I  F 

5)  F  ->  (E) 

6)  I  a 

The  construction  process  begins  by  augmenting  the 
grammar  G  with  the  new  production 

0)  S  ->  E 

and  making  S  the  new  start  symbol. 

The  initial  item  set  is 

0 :  S  ->  •E  ,  e 

We  then  begin  to  complete  the  item  set,  by  first  computing 
its  closure:  since  [S  •E,e']  is  in  the  item  set,  we  add 
[E  ->’^«E+T,e]  and  [E  ->  •T,e];  the  ’dot*  before  the  E  Of 
[ S  ->  •£,€]  tells  us  to  add  all  defining  productions  of  E. 

We  next  use  TE  ->  •E+T,e]  to  add  [E  ->  •E+T,+]  and 
[E  ->  •T,+1  to  the  item  set:  the  lookahead  string  *+'  is  the 
sole  member  of  e_f ree_FIRST  (+ T) .  We  use  a  shorthand  to 
represent  several  items  differing  only  in  their  lookahead 
strings:  [E  •E+T,  {e,  +  }  ]  and  [E  •T,{e,+}].  The  next 

item  we  examine,  [E  ->  •T,e],  causes  the  addition  of  the 
items  [T  •T*F,e]  and  [T  ->  •F,e].  Continuing  in  this 
manner,  we  obtain  the  following  item  set: 


-> 

•  E 

9 

{e} 

-> 

•  E+T 

9 

-> 

•  T 

9 

{e,+) 

-> 

•  T*F 

9 

{e,+,*} 

-> 

•  F 

9 

{e,+,*} 

-> 

.(E) 

f 

{e,+,*} 

-> 

•  a 

9 

{e,+,*} 
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Next,  we  compute  the  successors  of  item  set  0:  for  each 
item  with  a  dot  in  front  of  X,  for  each  X  6  V,  we  create  a 
new,  incomplete  item  set,  containing  items  with  the  dot 


after  the 

X. 

From  item 

sets 

1 : 

■  • 

S  -> 

E* 

.  9 

{e) 

E  -> 

E*  +  T 

9 

{e,+} 

2; 

E  -> 

T« 

9 

[e,+} 

T  -> 

T«*F 

9 

{e,+,*} 

3 : 

T  -> 

F* 

9 

4: 

F 

(•E) 

9 

5: 

F  -> 

a* 

9 

{e, 

We  will  then  proceed  to  perform  the  completion  operation 
on  these  new  item  sets,  in  turn.  The  full  LB(1)  item  set 
collection  is  shown  below: 


0  : 

S  -> 

•  E 

9 

{e} 

E 

•  E+T 

9 

{e,+} 

E  -> 

•  T 

9 

{e,+} 

T  -> 

•t*f 

9 

{e,+,*} 

T  -> 

•  F 

9 

{e,+,*} 

F  -> 

•  (E) 

9 

{e,+,*} 

F  -> 

•  a 

9 

{e,+,*} 

1 : 

s  -> 

E« 

9 

{e} 

E  -> 

E«+T 

9 

{e,+} 

2 : 

E  -> 

T* 

9 

{e,+} 

T  -> 

T**F 

9 

{€,+,♦} 

3: 

T  -> 

F» 

9 

{e,+,*} 

4: 

F  -> 

(•E) 

9 

{e,+,*} 

E  -> 

•  E+T 

9 

{),+} 

E 

•  T 

9 

{) 

T  -> 

•t*f 

9 

{)r  +  r*} 

T  -> 

•  F 

9 

{)  .  + 

F  -> 

•  (E) 

9 

F  -> 

•  a 

9 

£)r+,*} 

5: 

F  -> 

a* 

9 

{e,  +  ,*} 

6: 

E  -> 

E  +  »T 

9 

{€,  +  } 

T  -> 

•t*f 

9 

{e,+,*} 

T  -> 

•  F 

9 

{e,+,*} 

F  -> 

•  (E) 

9 

{e,+,*} 

F  -> 

•  a 

9 

{e,+,*} 

7: 

T  -> 

T*^F 

9 

{e,+,*} 

F  -> 

•(E) 

9 

{e.+r*} 

F  -> 

•  a 

9 
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8:  F  -> 

(E.) 

9 

{e,+,*} 

E 

E«  +  T 

9 

{) 

9:  E  -> 

T« 

9 

{) 

T  -> 

T«*F 

9 

{) 

1C:  T  -> 

F* 

9 

{)  r  +  r*} 

11:  F 

(•E) 

9 

{) 

E  -> 

•  E  +  T 

9 

{)r+} 

E  -> 

•  T 

9 

{) 

T  -> 

•t*f 

9 

Or+r*} 

T  -> 

•  F 

9 

{)  .  + 

F  -> 

•(E) 

9 

{),  +  r*} 

F  -> 

•  a 

9 

{)  r  +  r*} 

12:  F  -> 

a« 

9 

{)  r+r*} 

13:  E  -> 

F+T* 

9 

{e,+} 

T  -> 

t»*f 

9 

{e,+,*} 

14:  T  -> 

T+F* 

9 

{e,+,*} 

15:  F  -> 

(E). 

9 

{€,  +  ,*} 

16:  E  -> 

E  +  «T 

9 

{) 

T  -> 

•t*f 

9 

£)  f  +  r*} 

T  -> 

•  F 

9 

{) 

F  -> 

•(E) 

9 

{).  +  .*} 

F  -> 

•  a 

9 

{)  ,  +  r*} 

17:  T  -> 

T*«F 

9 

{) 

F  -> 

•(E) 

9 

{)  r  +  ,*} 

F  -> 

•  a 

9 

{)  r+r*} 

18:  F  -> 

(E.) 

9 

{)  r+r*} 

E  -> 

E*  +  T 

9 

{)r+} 

19:  E  -> 

E  +  T* 

9 

{)  r^} 

T  -> 

T**F 

9 

{)r  +  r*} 

20:  T  -> 

T*F* 

9 

{)  r  +  r*} 

21:  F  -> 

(E)* 

9 

{)  r  +  r*} 

Now  that  the  item  sets  are  constructed,  we  can  construct 
the  parsing  tables  from  them.  For  example,  since  item  set  0 
contains  the  item  [F  ->  ♦a,{e, +  ,*}],  we  set  fO(a)  -  shift. 
Likewise,  since  item  set  0  also  contains  [  F  ->  •(E),{e,+}  J, 
we  set  f0(()  =  shift.  All  values  for  fo  for  other  input 

symbols  will  be  error .  When  we  *move  the  dot  across'  an  E 
in  item  set  0,  we  then  have  moved  to  item  set  1:  therefore 
we  set  gO  (E)  =  1.  Likewise, gO  (a)  =  5.  The  complete  tables 
are  given  below.  We  have  used  x  to  indicate  an  ©rror 
action,  S  a  shift  action,  Ei  for  re^ce  i,  and  A  for  a^e£t. 
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f 

1  I 

g 

+ 

* 

( 

) 

a 

e 

1  1 

E 

T 

F 

+ 

♦ 

( 

) 

a 

0 

X 

X 

S 

X 

S 

X 

1  1 

1 

2 

3 

X 

X 

4 

X 

5 

1 

s 

X 

X 

X 

X 

A 

1  1 

X 

X 

X 

6 

X 

X 

X 

X 

2 

E2 

S 

X 

X 

X 

R2 

I  1 

X 

X 

X 

X 

7 

X 

X 

X 

3 

R4 

R4 

X 

X 

X 

R4 

1  1 

X 

X 

X 

X 

X 

X 

X 

X 

4 

X 

X 

S 

X 

S 

X 

1  1 

8 

9 

10 

X 

X 

11 

X 

12 

5 

E6 

R6 

X 

X 

X 

R6 

1  1 

X 

X 

X 

X 

X 

X 

X 

X 

6 

X 

X 

S 

X 

s 

X 

1  1 

X 

13 

3 

X 

X 

4 

X 

5 

7 

X 

X 

S 

X 

s 

X 

i  1 

X 

X 

14 

X 

X 

4 

X 

5 

8 

S 

X 

X 

s 

X 

X 

1  1 

X 

X 

X 

16 

X 

X 

15 

X 

9 

E2 

S 

X 

R2 

X 

X 

i  1 

X 

X 

X 

X 

17 

X 

X 

X 

10 

R4 

R4 

X 

R4 

X 

X 

1  1 

X 

X 

X 

X 

X 

X 

X 

X 

1 1 

X 

X 

s 

X 

S 

X 

1  1 

18 

9 

10 

X 

X 

11 

X 

12 

12 

R6 

R6 

X 

R6 

X 

X 

1  I 

X 

X 

X 

X 

X 

X 

X 

X 

13 

R  1 

S 

X 

X 

X 

R1 

1  1 

X 

X 

X 

X 

7 

X 

X 

X 

14 

R3 

R3 

X 

X 

X 

R3 

1  i 

X 

X 

X 

X 

X 

X 

X 

X 

1  5 

R5 

R5 

X 

X 

X 

R5 

1  1 

X 

X 

X 

X 

X 

X 

X 

X 

16 

X 

X 

s 

X 

s 

X 

1  1 

X 

19 

10 

X 

X 

11 

X 

12 

1  7 

X 

X 

s 

X 

s 

X 

1  i 

X 

X 

20 

X 

X 

11 

X 

12 

18 

s 

X 

X 

S 

X 

X 

1  1 

X 

X 

X 

16 

X 

X 

21 

X 

19 

R1 

S 

X 

R1 

X 

X 

1  1 

X 

X 

X 

X 

17 

X 

X 

X 

20 

R3 

R3 

X 

R3 

X 

X 

1  1 

X 

X 

X 

X 

X 

X 

X 

X 

21 

R5 

R5 

X 

R5 

X 

X 

1  1 

X 

X 

X 

X 

X 

X 

X 

X 

We 

refer  the 

reader 

to 

the  LR  (0) 

and  SLR 

(1) 

tables  for  the 

same  grammar 

shown 

in 

Figures 

4. 

3. Ic-d 

• 

We 

can 

use  the 

tables 

ta 

parse 

the 

input 

string  '  (a+a) * 

as 

follows 

• 

• 

we 

begin 

in 

table  0 

• 

fO(() 

=  shift,  so  we 

remove 

the 

I 

from  the 

input  string. 

and 

stack 

gO  (  0  ,  which 

is 

table  4 

• 

In 

table 

4r 

we  consult 

which  is  shift. 

g* 

(a) 

=  12 

• 

The 

next 

parsing 

step 

consults 

fl2(+) 

=  reduce 

6. 

This 

action 

removes 

one 

table  from  the 

Stack 

(there 

is 

one 

symbol  in 

production 

6)  . 

We  have  table 

#4 

on  top 

of 

the 

stack 

• 

Since  we  have 

just 

recognized  the 

nonterminal 

symbol 

F, 

we 

consult 

g^ 

(F) 

9 

which 

is  table  10, 

flO(+) 

=  reduce 

4. 

Again 

we 

pop 

one  element 

from  the  stack 

and  consult  the  g-function  of  the  table  on  the  top  of  the 
stack,  this  time  with  argument  T,  since  we  have  just 
recognized  the  nonterminal  symbol  T.  g^  (T)  =  table  9.  We 
continue  on  in  this  manner.  A  history  of  the  parse  is  shown 
below : 


-94- 


Parse 
Step  # 

Stack 

Contents 

Input 

Action 

1 

0 

(a+a) 

shift 

2 

0,4 

a+a) 

shift 

3 

0,4,12 

+  a) 

reduce 

6 

4 

0, 4, 10 

+  a) 

reduce 

4 

5 

0,4,9 

+a) 

reduce 

2 

6 

0,4,8 

+  a) 

shift 

7 

0,4,8,16 

a) 

shift 

8 

0,4,8,16,12 

) 

reduce 

6 

9 

0, 4, 8, 16, 1C 

) 

reduce 

4 

10 

0,4,8,16,19 

) 

reduce 

1 

11 

0,4,8 

) 

shift 

12 

0,4,8,15 

e 

reduce 

5 

13 

0,3 

e 

reduce 

4 

1  4 

0,2 

e 

reduce 

2 

15 

0,1 

e 

accept 

Likewise,  the  illegal  input  string  ' (a+) '  would  have  the 
following  parse  history: 


Parse 

Stack 

Input 

Action 

Step  # 

Contents 

1 

0 

(a+) 

shift 

2 

0,4 

a+) 

shift 

3 

0, 4, 12 

reduce 

6 

4 

0,4,10 

+  ) 

r  educe 

4 

5 

0,4,9 

+  ) 

reduce 

2 

6 

0,4,8 

+) 

shift 

7 

0,4,8,16 

) 

error 
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D.  Tsichritzis,  June  1975  [Proceedings  Very  Large 
Data  Base  Conference,  1975] 


*  CSEG-57  MEELIN:  TOWAEDS  AN  IDEAL  PEOGEAMMING  LANGUAGE 
Eric  C.E.  Hehner,  July  1975 


CSEG-58  ON  THE  SEMANTICS  OF  THE  RELATIONAL  DATA  MODEL 
Hans  Albrecht  Schmid  and  J.  Richard  Swenson, 

July  1975  [Proceedings  of  the  ACM  SIGMOD  Conference, 
1975] 

CSPG-59  THE  SPECIFICATION  AND  APPLICATION  TO  PROGRAMMING 
OF  ABSTRACT  DATA  TYPES 

John  V.  Guttag,  September  1975  [Ph.D.  Thesis,  DCS,  1975] 

CSRG-60  NORMALIZATION  AND  FUNCTIONAL  DEPENDENCIES  IN  THE 
RELATIONAL  DATA  BASE  MODEL 
Phillip  Alan  Bernstein,  October  1975 
[Ph.D.  Thesis,  DCS,  1975] 

CSRG-61  LSL:  A  LINK  AND  SELECTION  LANGUAGE 

D.  Tsichritzis,  November  1975  [Proceedings  ACM 
SIGMOD  Conference,  1976] 


CSRG-62  COMPLEMENTARY  DEFINITIONS  OF  PROGRAMMING 
LANGUAGE  SEMANTICS 
James  E.  Donahue,  November  1975 
[Ph.D.  Thesis,  DCS,  1975] 

CSRG-63  AN  EXPERIMENTAL  EVALUATION  OF  CHESS  PLAYING 
HEURISTICS 

Lazio  Sugar,  December  1  975  [M.Sc.  -Thesis,  DCS,  1975  ] 

CSEG-64  A  VIRTUAL  MEMORY  SYSTEM  FOR  A  RELATIONAL 
ASSOCIATIVE  PROCESSOR 

S.A.  Schuster,  E.A.  Ozkarahan,  and  K.C.  Smith, 
February  1976  [Proceedings  National  Computer 
Conference  1976,  v.45,  pp. 855-862] 


CSRG-65  PERFORMANCE  EVALUATION  OF  A  RELATIONAL 
ASSOCIATIVE  PROCESSOR 

E.A.  Ozkarahan,  S.A.  Schuster,  and  K.C.  Sevcik, 
February  1976  [ACM  Transactions  on  Database 
Systems,  v.1,  n:4,  December  1976] 

CSRG-66  EDITING  COMPUTER  ANIMATED  FILM 

Michael  D.  Tilson,  February  1976 
[M.Sc.  Thesis,  DCS,  1975] 

CSRG-67  A  DIAGRAMMATIC  APPROACH  TO  PROGRAMMING  LANGUAGE 
SEMANTICS 

James  R.  Cordy,  March  1976  [M.Sc.  Thesis,  DCS,  1976] 

CSRG-68  A  SYNTHETIC  ENGLISH  QUERY  LANGUAGE  FOR  A 
RELATIONAL  ASSOCIATIVE  PROCESSOR 

L. Kerschberg,  E.A,  Ozkarahan,  and  J.E.S.  Pacheco 
April  1976 


CSI^G-69 

AN  ANNOTATED  BIBLIOGRAPHY  ON  COMPUTER  PROGRAM  ENGINEERING 

D.  Barnard  and  D.  Thompson  (Eds,)»  Fourth  Edition,  May  1976 

CSRG-70 

A  TAXONOMY  OF  DATA  MODELS 

L.  Kerschberg,  A.  Klug,  and  D.  Tsichritzis,  May  1976 
[Proceedings  Very  Large  Data  Base  Conference,  1976] 

CSRG-71 

OPTIMIZATION  FEATURES  FOP.  THE  ARCHITECTURE  OF  A 

DATA  BASE  MACHINE 

E.A.  Ozkarahan  and  K.C.  Sevcik,  May  1976 

CSRG-72 

THE  RELATIONAL  DATA  BASE  SYSTEM  OMEGA  -  PROGRESS  REPORT 

H.A.  Schmid  (ed.) r  P-A.  Bernstein  (ed.)r  B-  Arlow, 

R,  Baker  and  S.  Pozgaj,  July  1976 

CSRG-73 

AN  ALGORITHMIC  APPROACH  TO  NORMALIZATION  OF 

RELATIONAL  DATA  BASE  SCHEMAS 

P.A.  Bernstein  and  C,  Beeri,  September  1976 

CSRG-74 

A  HIGH-LEVEL  MACHINE-ORIENTED  ASSEMBLER  LANGUAGE 

FOR  A  DATA  BASE  MACHINE 

E.A.  Ozkarahan  and  S.A.  Schuster,  October  1976 

CSRG-75 

DO  CONSIDERED  OD:  A  CONTRIBUTION  TO  THE 

PROGRAMMING  CALCULUS 

Eric  C.R.  Hehner,  November  1976 

CSRG-76 

"SOFTWARE  HUT":  A  COMPUTER  PROGRAM  ENGINEERING 

PROJECT  IN  THE  FORM  OF  A  GAME 

J.J.  Horning  and  D. B.  Wortman,  November  1976 

CSRG-77 

A  SHORT  STUDY  OF  PROGRAM  AND  MEMORY  POLICY  BEHAVIOUR 

G,  Scott  Graham,  January  1977 

CSRG-78 

A  PANACHE  OF  DBMS  IDEAS 

D.  Tsichritzis,  February  1977 

CSRG-79 

THE  DESIGN  AND  IMPLEMENTATION  OF  AN  ADVANCED  LALR 

PARSE  TABLE  CONSTRUCTOR 

David  H.  Thompson,  April  1977  [M.Sc.  Thesis,  DCS,  1976] 

